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COMPOSITIONS AND METHODS FOR INFERRING ANCESTRY 

BACKGROUND OF THE INVENTION 
FELD OF THE INVENTION 

[0001] The invention relates generally to the identification of genetic markers predictive 
of an individual's biogeographical ancestry,! and more specifically to combinations of single 
nucleotide polymorphisms useful as ancestry informative markers (AMs), which allow an 
inference as to a trait of an individual, algorithms for identifying such AIMs, and methods of 
using such AIMs to infer a trait of an individual, including an individual's ancestry, 
responsiveness of an individual to a drug, and predisposition of an individual to a disease. 

BACKGROUND INFORMATION 

[0002] The majority (80-90%) of the genetic variation among human individuals is inter- 
individual, and only a relatively small proportion (10-20%) is due to population differences 
(Nei, In Molecular Population Genetics (Columbia University Press, New York) 1987; 
Cavalli-Sforza et al. , In The History and Geography of Human Genes (Princeton University 
Press, Princeton NJ) 1994; Delcaetal., Electrophoresis 16:1659-1664, 1995; Rosenberg et 
al., Science 298:2381-2385, 2002; Akey et el., Bio Techniques 30:348-367, 2001; Akey et al., 
Hum. Genet. 108:516-520, 2002). Most populations share alleles and those alleles that are 
most frequent in one population are also frequent in others. There are very few classical 
markers (e.g., blood group, serum protein, and immunological markers) or DNA genetic 
markers that are population-specific or have large frequency differentials among 
geographically and ethnically defined populations (Roychodhury and Nei, In Human 
Polymorphic Genes: World Distribution (Oxford University Press, New York) 1988; Dean et 
al., Amer. J. Hum. Genet. 55:788-808, 1994; Cavalli-Sforza et al., supra, 1994, Akey et al., 
supra, 2001, 2002). Despite this apparent lack of unique genetic markers, there are marked 
physical and physiological differences among human populations that presumably reflect 
genetic adaptation to unique ecological conditions, random genetic drift, and sex selection. 

In contemporary populations, these differences are evident in morphological differences 
between ethnic groups, as well as in differences in drug responsiveness and in susceptibility 
and resistance to disease. 
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[0003] On a basic level, human population structure can be represented in terms of 
BioGeographical Ancestry (BGA), which is the heritable component of "race" or heritage, 
and which is relevant on any scale of resolution. For example, on a crude level, BGA can be 
determined for 2 groups (e.g., European vs. others); or on a fine level, e.g., it can refer to 
"race" in terms of 4 groups such as IndoEuropeans, East Asians, sub-Saharan African and 
Native American; or on a finer level, e.g., it can refer to ethnicity within the European group 
(for example, Mediterranean or Scandinavian); or on a still finer level, e.g., it can even refer 
to groups of families within ethnic groups, such as groups of O'Reilly's descendent from a set 
of common ancestors within the Irish group. The measurement of BGA is relevant for most 
any type of genetics or epidemiological study design. For example, BGA is an important 
component in the variability of drug response (Burroughs et al., J. Natl. Med Assoc. 94: 1-26, 
2002). The reason for this relationship is that genetic drift, geographical and/or reproductive 
isolation, and regional selective pressures have molded the allele frequencies of our ancestors 
for compatibility with alkaloids, tannins (self-defense chemicals), and other xenobiotics 
found in indigenous diets. Most drugs are derived from such chemicals and, therefore, it is 
no coincidence that the family of enzymes that allow humans to detoxify drugs are found at 
different frequencies in different populations. This scenario is not unique to drug 
responsiveness, and many other parts of the genome that are unrelated to drug responsiveness 
are subject to these same types of pressures. 

[0004] Investigators generally have been concerned with identifying gene variants that 
cause a disease (the so called "phenotypically active" loci), rather than identifying gene 
variants that are simply correlated with disease. As such, whatever the trait being examined, 
and for most study designs involving unrelated individuals, it has been considered important 
to control for population structure so as to avoid identifying markers of structure that 
correlate with trait value in a given sample rather than those in linkage disequilibrium (LD) 
with phenotypically active loci (Risch et al.. Genome Biology 3:1-12, 2001 Wang et al., 

Amer. J. Hum. Genet. 71:1227-1234, 2002; Burroughs et al., supra , 2002; Rao and 
Chakraborty, Amer. J. Hum. Genet. 26:444-453, 1974). There are two sources of population 
structure in a sample collection: 1) sampling effects, which can create structure even if 
sampling is performed from homogeneous populations, and 2) natural human demography. 
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The first source of population structure is a nuisance for genetics studies, and associations 
found from a study due to this type of structure are generally considered an artifact of the 
collection process rather than a reflection of human demography. Most geneticists generally 
consider the second kind of structure to be a nuisance as well. As such, associations 
identified as being due to population structure have been considered spurious findings or 
artifacts, and have generally been discarded; only findings due to true linkage or LD have 
been published, as such markers are considered linked to biologically relevant genes. 

[0005] Much effort has been directed to quantifying both types of population structure 
(above) in groups of individuals. Such methods essentially measure the departure from 
expected levels of heterozygosity within a group of samples as an indication of structure 
(though none of these methods are capable of reading within-individual structure). Many 
common diseases exhibit locus and/or allelic heterogeneity as a function of BGA, and man y 
authors have suggested that inappropriate attention to population structure during the study 
design step has produced at least some of the so-called "false positive" results implicated in 
the rash of irreproducible Common Disease/Common Variant results obtained to da te 
(Terwilliger et al., Curr. Opin. Genet. Devel. 12:726-734, 2002). In order to control for the 
influence of population structure, several tests are appropriate (Cockerham, Evolution 
23:72-83, 1969; Cockerham, Genetics 74:679-700, 1973; Wier and Cockerham, Evolution 
38:1358-1370, 1984; Long, Genetics 1 12:629-647, 1986, Excoffier et al.. Genetics 
131:343-359, 1992). These methods can be grouped in two main categories - genomic 
control methods (Devlin and Roeder, Biometrics 55:997-1004, 1999), and structured 
association (SA) methods (Pritchard and Donelly, Theor. Popul. Biol. 60:227-237, 2001). 
Both methods require genotyping of a panel of unlinked markers to estimate and correct for 
the effect of genetic structure, but they are usually applied for sample collections. However, 
should a pool of samples fail such a test, it is usually not clear which samples should be 
eliminated to rectify the problem. An equally vexing problem with this method is that it is 
often applied for a study sample after the creation of expensive data, thus creating a circular 
logic problem in addition to an economic problem; these methods are usually employed to 
extract information on population structure using the characteristics of the data within which 
associations are sought. 
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[0006] In order to minimize the influence of population structure from the outset of an 
effort to identify phenotypically active loci, where structure or admixture is not to be used as 
a statistical fuel, it is generally desirable to qualify samples based on crude population 
stratifications such as BGA so that cases and controls can be matched and homogenized in 
composition. For example, it is not uncommon in the execution of case-control studies to 
ensure equal proportions or "racial homogeneity" within and between cases and controls. 
However, for most research purposes, the subjective methods used to measure population 
affiliation are unsatisfactory. As currently measured using biographical questionnaires, little 
knowledge of population structure other than the obvious is obtained, and only basic 
connections between population structure and drug response can be apparent and/or 
controlled. Consistency is a significant problem with the self-reporting of race on 
questionnaires, and one that the Food and Drug Administration is attempting to address 
during the clinical trial design process. However, using such subjective and imprecise 
methods of data collection, consistency can be a difficult end to achieve. 

[0007] Rather than reformulating how questions are asked on questionnaires, consistency 
can be better addressed by replacing the subjective nature of the exercise with objective, 
reproducible scientific methods. Standardization and objectivity is of paramount importance 
for the collection of race data because its measure can be as subjective as that of any other 
human attribute. The self-reporting of race is not as trivial an exercise as the self-reporting of 
gender, and many people do not know their race or are of sufficient admixture that they have 
trouble classifying themselves into a single group. Such a scenario is particularly common in 
countries such as the United States, in which numerous cultures have been combined due to 
immigration. For example, a woman of mainly sub-Saharan African descent, raised in Puerto. 
Rico, may describe herself as Hispanic. Though she socio-culturally identifies with 
Hispanics, however, her xenobiotic metabolism and drug target polymorphisms may more 
likely be associated with those shared among other sub-Saharans. By using nonanthropologic 
designations that describe the socio-cultural construct of society, current guidelines for 
considering information on race in the study design process can effect poor predictive power 
and false positive results. Where a person was raised and lives, and the cultural or 
sociological customs they observe, may have an impact on how that person responds to a 
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drug or proclivity to develop a disease. Thus, non-biological metrics are required, but the 
evidence suggests that BGA also has an impact and, therefore, needs to be measured in a 
scientifically accurate and reproducible manner. 

[0008] Genetic markers present in a person's DNA provide the best opportunity to 
reliably determine the BGA an individual, and it has long been recognized that such a means 
is possible. For example, Reed ( Science 244:575-576, 1973) and Neel {Mutat. Res. 
26:319-328, 1974) referred to such markers as "private", and used them to estimate mutation 
rates. Reed {supra, 1973) used the term "ideal" (in reference to the utility of the markers in 
individual ancestry estimation) to describe hypothetical genetic marker loci at which different 
alleles are fixed in different populations. Chakraborty et al. (J Ethnic. Dis. 1 -.245-256, 1991) 
referred to variants that are found in only one population as "unique alleles", and showed how 
allele frequencies could be inverted to provide a likelihood estimate of population, or BGA 
affiliation. The most useful "unique alleles" for the inference of BGA are those that also have 
large differences in allele frequency among populations (Reed, supra, 1973; Chakraborty et 
al., Genetics 130:231-243, 1992; Stephens et al., Amer. J. Hum. Genet. 55:809-824, 1994), 
and that have been referred to as "population-specific alleles" (PSAs, Shriver et al., Amer. J. 
Hum. Genet. 60:957-964, 1997; Parra et al., Amer. J. Hum. Genet. 63:1839-1851, 1998), but 
which are now referred to as "Ancestry Informative Markers" (AIMs; Shriver et al., Hum. 
Genet, 1 12:387-399, 2003, Frudakis et al., J. Forens. Sci. 48(4) 771-782, 2003)/ 

[0009] Within the field of forensics, statistical methods that use simple tandem repeats 
(STRs) to infer the highest level of ancestry in a particular individual (majority BGA using 
proportional ancestry notation) can be fairly robust in terms of estimating majority BGA 
affiliation. Although STR tests can effectively resolve majority ancestral origin in most 
cases, an unacceptable number (5-10%) of classifications are ambiguous. Aside from 
sampling errors caused by rare alleles, and the fact that STRs were not selected from the 
genome for their ability to resolve population affiliation (i.e., STR allele frequency 
differentials are not necessarily nor optimally informative for this purpose), the major reason 
the high level of ambiguity likely is due to admixture, which is clearly a factor of the genetic 
variation for many human populations (Parra et al., supra, 1998, Cavalli-Sforza and Bodmer, 
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In The genetics of human populations (Dover Publications, NY; see pages 387-507) 1999; 
Rosenberg et al., supra, 2002). For a given study design, whether using self-reported 
information or DNA marker testing, and whether attempting to solve a pharmacogenomic or 
forensic problem, classifying a patient into a single group sacrifices the subtle, but not 
insignificant, information related to population structure and sub-structure; for example, there 
is no allowance to assign a person of 50% African and 50% European affiliation into a group. 
Unfortunately, markers and methods for allowing an accurate inference as to the BGA for 
more than just two groups at a time for an individual have not yet been described. Thus, a 
need exists for robust markers useful for inferring BGA, and for methods of identifying and 
using such markers. The present invention satisfies this need, and provides additional 
advantages. 

SUMMARY OF THE INVENTION 

[0010] The present invention provides methods and compositions for measuring, with a 
desired predetermined level of confidence, within individual population structure, which, as 
disclosed herein, allows inferences to be drawn, for example, as to ancestry, pigmentation 
traits, drug responsiveness, and disease susceptibility of the individual. By way of example, 
the present methods and compositions were used in a forensics capacity, wherein DNA 
samples obtained at the crime scenes of a serial murder/rapist in Louisiana were examined. 
Based on psychological profiling, police were of the belief that the serial killer was a 
Caucasian male, and had tested the DNA of over 1 ,000 Caucasian men without finding a 
match. The police then turned to the inventors, who, using the compositions and methods of 
the invention, determined that the individual committing the crimes was African American 
and, more specifically, had a proportional and confidence qualified ancestry of 85% sub- 
Saharan African and 15% Native American. Based on this result and additional results as 
disclosed herein, the police were further advised that the average African American is of 20% 
IndoEuropean ancestry, that greater levels of IndoEuropean ancestry correlate with lighter 
skin tone, and, therefore, that the person committing the crimes was likely an African 
American with average to darker than average skin tone. Within two months of refocusing 
their efforts based on this information, the police arrested an African American man of 
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average skin tone (for African Americans); DNA testing determined that he was the person 
whose DNA was found at the crime scenes. 

[0011] Accordingly, the present invention relates to a method of inferring, with a 
predetermined level of confidence, a trait of an individual. Such a method can be performed, 
for example, by contacting a sample, which includes nucleic acid molecules of a test 
individual, with hybridizing oligonucleotides, wherein the hybridizing nucleotides can detect 
nucleotide occurrences of single nucleotide polymorphisms (SNPs) of a panel of at least 
about ten ancestry informative markers (AIMs) indicative of a population structure correlated 
with the trait, and wherein said contacting is performed under conditions suitable for 
detecting the nucleotide occurrences of the AIMs of the individual by the hybridizing 
oligonucleotides; and identifying, with a predetermined level of confidence, a population 
structure that correlates with the nucleotide occurrences of the AIMs in the individual, 
wherein the population structure correlates with a trait. As disclosed herein, a panel of at 
least about ten AIMs (e.g., 8, 9, 10, 1 1, 12, 13, 14, 15, 20, 25, 30, or more) is examined in 
practicing a method of the invention. Generally, the greater number of AIMs examined, the 
greater the confidence level of an inference made using the method. 

[0012] A trait for which an inference is made according to a method of the invention can 
be any trait, including a trait for which an ethnic predisposition is known or suspected to 
occur and a trait for which it known that no ethnic predisposition occurs or for which it is not 
known or unclear as to whether there is an ethnic predisposition. In one embodiment, the 
trait is biogeographical ancestry (BGA). In one aspect, the panel of ATMs used to examine 
BGA includes AIMs as set forth in SEQ ID NOS: 1 to 71 . In another aspect, the panel 
includes AIMs as set forth in SEQ ID NOS:7, 21, 23, 27, 45, 54, 59, 63, and 72 to 152; in 
SEQ ID NOS:3, 8, 9, 11, 12, 33, 40, 59, 63, and 153 to 239; or in SEQ ID NOS:l, 8, 11, 21, 
24, 40, 1 72, and 240 to 33 1 , as well as a panel containing combinations of AIMs as set forth 
in SEQ ID NOS : 1 to 33 1 . As disclosed herein, AIMs useful in practicing a method of the 
invention can, but need not, be linked to a gene linked to the trait (i.e., a gene known to be 
involved in the trait phenotype) and generally are not in linkage disequilibrium with the gene 
(or locus). For example, an AIM useful for inferring drug responsiveness of an individual 
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according to a method of the invention need not be linked to a gene involved in 
responsiveness to the drug (e.g., a drug metabolism gene or a drug transport gene such as a 
cytochrome P450 gene or P-glycoprotein gene). Similarly, an AIM useful for inferring a 
pigmentation trait of an individual according to a method of the invention need not be linked 
to a gene involved in pigmentation (e.g., a tyrosinase gene or a melanocortin- 1 receptor 
gene). Thus, in one aspect, at least one (e.g., 1, 2, 3, 4, or 5) AIM of a panel is not linked to a 
gene involved in the trait for which an inference is being made. 

[0013] Where BGA is the trait for which an inference is being mad e, an individual being 
examined can have an ancestry that includes any one or a combination of ancestral groups, 
including, for example, a proportion of sub-Saharan African ancestry, Native American 
ancestry, IndoEuropean ancestry, East Asian ancestry, Middle Eastern ancestry, Pacific 
Islander ancestry, or a combination including one or more of these ancestries. As such, the 
proportional ancestry of an individual can comprise one ancestry (e.g., 100% IndoEuropean 
ancestry), or any proportion of two, three, four, or more ancestral groups. As such, a test 
individual (or individual of known proportional ancestry) can have, for example, a proportion 
of at least three ancestral groups, which can include proportions of sub-Saharan African 
ancestry and two other ancestries, or can include proportions of sub-Saharan African and 
IndoEuropean ancestral groups and a third ancestry; or Native American and IndoEuropean 
ancestral groups and a third ancestry; or East Asian and Native American ancestral groups 
and a third ancestry; or IndoEuropean and East Asian ancestral groups and a third ancestry; 
or can include proportions of Native American, East Asian, and IndoEuropean ancestral 
groups, or of sub-Saharan African, Native American, and IndoEuropean ancestral groups, and 
the like. 

[0014] In another embodiment, a trait of a test individual for which an inference is being 

made is responsiveness of the individual to a chug, particularly a therapeutic drug. As such, a 
method of the invention provides a tool for realizing personalized medicine. A drug for 
which an inference can be made as to whether a test individual will be responsive, in either a 
positive or negative manner, can be, for example, a cancer chemotherapeutic agent such as 
paclitaxel, or a drug such as a statin, which can be useful for maintaining or lowering 
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cholesterol levels. In one aspect of this embodiment, AIMs of the panel of AIMs used to 
practice the method includes AIMs of genes other than genes known to be involved in 
melanin synthesis or metabolism. 

[0015] In still another embodiment, a trait of a test individual for which an inference is 
being made is a susceptibility or predisposition of the individual to a disease. As disclosed 
herein, various traits are associated with population structure 'at a continental level, whereas 
other traits are associated with population structure at finer levels. As such, a method of the 
invention can provide an means for making an inference with respect to a trait such as disease 
susceptibility for diseases such as diabetes, hypertension, and cancers that are known to have 
an ethnic predisposition (i.e., known to occur with higher frequencies in individuals of certain 
ethmc/ancestral groups), as well as for disease such as such as alcoholism, or schizophrenia, 
Parkinson's disease, and other neurological disorders, which do not (or at least are not known 
to) have an ethnic predisposition. 

[0016] In yet another embodiment, a trait of a test individual for which an inference is 
being made is a pigmentation trait. The pigmentation trait can be any such trait including, for 
example, eye color or shade, skin color, hair color, or a combination thereof. In one aspect of 
this embodiment, AIMs of the panel of AIMs used to practice the method includes AIMs of 
genes other than genes known to be involved in melanin synthesis or metabolism, or other 
aspects of pigmentation. 

[0017] A method of inferring a trait of a test individual by determining a population 
structure that correlates with nucleotide occurrences of AIMs in the individual can further 
include identifying, with a predetermined level of confidence, a sub-population structure of 
the population structure, wherein the sub-population structure correlates with a trait. For 
example, a population structure of an individual can correlate to an intercontinental group 
with which, by inference, the individual shares ancestry, for example, IndoEuropean, and a 
sub-population structure can further correlate with an intracontinental group with which the 
individual shares IndoEuropean ancestry, for example, Mediterranean ethnicity. 
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[0018] The hybridizing oligonucleotides useful in the methods of the invention can be 
oligonucleotide probes or oligonucleotide primers. Oligonucleotide probes useful in the 
present methods can hybridize to a nucleotide sequence that includes the SNP position for an 
AIM, wherein the nucleotide at the position of the hybridizing oligonucleotide that 
corresponds to the position of the SNP for the AIM either matches or does not match the 
nucleotide occurrence at the SNP position. Additional oligonucleotide probes useful in the 
methods of the invention include oligonucleotide probes that hybridize to a polynucleotide 
sequence adjacent to and upstream and/or adjacent to and downstream of the SNP position, 
and that can, but need not, include a nucleotide corresponding to the nucleotide position of 
the SNP, and wherein such a corresponding nucleotide, when present in the probe, can, but 
need not match the nucleotide occurrence at the SNP. 

[0019] Oligonucleotide primers useful in the methods of the invention include 
oligonucleotide primers useful for a primer extension reaction, as well as oligonucleotide 
primers that, in combination, allow for amplification of template polynucleotide comprising 
the AIM. Such amplification primer pairs generally include a forward primer and a reverse 
primer useful for amplification of a template polynucleotide comprising an AIM of interest. 

It will be recognized, however, that 2, 3, 4, or more different forward primers can be used 
with a common reverse primer for amplification of different template polynucleotides 
comprising the AIM (e.g., in a multiplex reaction) and a common gene sequence (e.g., AIMs 
of a family of related gene sequences) or for generating amplification products of different 
sizes from a single template. Similarly, one common forward primer can be used with one or 
a plurality of different reverse primers. 

[0020] Accordingly, in one embodiment, a method of the invention is performed using 
oligonucleotide primers. In one aspect of this embodiment, the method includes contacting 
the sample with the oligonucleotide primers and with a polymerase, under condition suitable 
for generation of a primer extension product. In such a method, the nucleotide occurrence of 
a SNP can be determined by detecting the presence of the primer extension product, or by 
sequencing the primer extension product (or a product thereof) and identifying the nucleotide 
at the position corresponding to the position of the SNP. In another aspect of this 




WO 2004/016768 



PCT/US2003/026229 



11 

embodiment, the method includes contacting the sample with oligonucleotide primers that 
comprise amplification primer pairs and with a polymerase, under condition suitable for 
generation of an amplification product In such a method, the nucleotide occurrence of a 
SNP can be determined by detecting the presence of the amplification product, or by 
sequencing the amplification product (or a product thereof) and identifying the nucleotide at 
the position corresponding to the position of the SNP. 

"[00211 The methods of the invention are particularly adaptable to being performed in a 
high throughput format, including in a multiplex format, thus allowing examination of a large 
number of AIMs and/or a large number of samples of test individuals, as well as controls, in 
parallel. As such, the methods can be performed using a format in which the samples being 
examined are arranged in an array, particularly an addressable array, e.g., on in wells in a tray 
or on a glass slide or silicon chip, and can be partly or fully automated using robotics. Where 
a multiplex platform is used, it will be recognized that the AIMs examined need not 
necessarily be those having the greatest delta values for the particular trait, but also can be 
selected to balance the delta value with the compatibility of primers in a multiplex set, for 
example, to select AIMs such that hybridizing oligonucleotides (e.g., amplification primer 
pairs) can be designed that can be used in a single reaction for e xamining a panel of AIMs but 
that do not substantially cross-hybridize with AIMs other than the target AIM for which the 
hybridizing oligonucleotides are designed. 

[0022] The present invention also relates to a method of estimating, with a predetermined 
level of confidence, proportional ancestry of at least two ancestral groups of a test individual. 
Such a method can be performed, for example, by contacting a sample, which includes 
nucleic acid molecules of the test individual, with hybridizing oligonucleotides that can 
detect nucleotide occurrences of SNPs of a panel of at least about ten AIMs that are 
indicative of BGA for each ancestral group examined, wherein the contacting is under 
conditions suitable for detecting the nucleotide occurrences of the AIMs of the test individual 
by the hybridizing oligonucleotides; and identifying, with a predetermined level of 
confidence, a population structure that correlates with, or is most likely given, the nucleotide 
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occurrences of the AIMs of each of the ancestral groups examined, wherein the population 
structure is indicative of proportional ancestry. 

[0023] The proportional ancestry estimated according to a method of the invention can be 
a proportion of any ancestral group, including, for example, a proportion of sub-Saharan 
African, Native American, IndoEuropean, East Asian, Middle Eastern, or Pacific Islander 
ancestral group, and generally is a combination of two or more of such ancestral groups. 

Thus, the proportional ancestry of a test individual can include proportions of sub-Saharan 
African and IndoEuropean ancestral groups (e.g., 80% sub-Saharan African and 20% 
IndoEuropean; or 60% sub-Saharan African, 20% IndoEuropean, and 20% of a third ancestral 
group), or can include proportions of Native American and IndoEuropean ancestral groups; 
East Asian and Native American ancestral groups; IndoEuropean and East Asian ancestral 
groups; and the like. Similarly, the proportional ancestry can include proportions of Native 
American, East Asian, and IndoEuropean ancestral groups; sub-Saharan Afric an, Native 
American, and IndoEuropean ancestral groups; sub-Saharan African, Native American, and 
East Asian ancestral groups; and the like. 

[0024] A panel of AIMs useful for estimating proportional ancestry of an individual can 
include AIMs as set forth in SEQ IDNOS.T to 331, for example, AIMs as set forth in SEQ 
ID NOS:l to 71, which can be useful for determining proportional ancestries including 
IndoEuropean, sub-Saharan African, East Asian, and Native American; or AIMs as set forth 
in SEQ ID NOS:7, 21, 23, 27, 45, 54, 59, 63, and 72 to 152, which can be useful for 
determining proportional ancestry of East Asians and sub-Saharan Africans; or in SEQ ID 
NOS:3, 8, 9, 11, 12, 33, 40, 59, 63, and 153 to 239, which can be useful for determining 
proportional ancestry of East Asians and IndoEuropeans; or in SEQ ID NOS:l, 8, 1 1, 21, 24, 
40, 172, and 240 to 33 1, which can be useful for determining proportional ancestry of 
IndoEuropeans and sub-Saharan Africans. 

[0025] In one embodiment, an estimate is made wherein the proportional ancestry 
includes proportions of three ancestral groups. In one aspect of this embodiment, identifying 
a population structure that correlates with, or is most likely given, the nucleotide occurrences 
of the AIMs of the test individual is practiced by performing a likelihood determination for 
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affiliation with each of a sub-Saharan African ancestral group, a Native American ancestral 
group, an IndoEuropean ancestral group, and an East Asian ancestral group; thereafter 
selecting three ancestral groups having a greatest likelihood value; determining a likelihood 
of all possible proportional affiliations among the three ancestral groups having the greatest 
likelihood value, whereby a population structure or proportional affiliation that correlates 
with the nucleotide occurrences of the AMs of the test individual is identified; and 
identifying a single proportional combination of maximum likelihood. 

[0026] In another aspect of this embodiment, identifying a population structure that 
correlates with, or is most likely given, the nucleotide occurrences of the AIMs is practiced 
by performing six two-way comparisons comprising likelihood determinations for affiliation 
between each group with each other group; thereafter selecting three ancestral groups having 
a greatest likelihood value; determining a likelihood of all possible proportional affiliations 
among the three ancestral groups having the greatest likelihood value, whereby a population 
structure or proportional affiliation that correlates with, or is most likely given, the nucleotide 
occurrences of the AIMs of the test individual is identified; and identifying a s ingl e 
proportional combination of maximum likelihood. 

[0027] In still another aspect of the embodiment wherein an estimate is made wherein the 
proportional ancestry includes proportions of three ancestral groups, the method is practiced 
by performing three three-way comparisons among the groups; determining a likelihood of 
all possible proportional affiliations among the three ancestral groups having the greatest 
likelihood value, whereby a population structure or proportional affiliation that correlates 
with, or is most likely given, the nucleotide occurrences of the AIMs of the test individual is 
identified; and identifying a single proportional combination of maximum likelihood. In 
another aspect of this embodiment, the method can further include generating a graphical 
representation of the comparison of the three ancestral groups, wherein the graphical 
representation comprises a triangle with each ancestral group independently represented by a 
vertex of the triangle, and wherein the maximum likelihood value of proportional affiliation 
for an individual comprises a point within the triangle. If desired, the graphical 
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representation can further include a confidence contour that indicates a level of confidence 
associated with estimating the proportional ancestry. 

[0028] In another embodiment, an estimate is made wherein the proportional ancestry 
includes proportions of four ancestral groups. In various aspects of this embodiment, 
identifying a population structure that correlates with, or is most likely given, the nucleotide 
occurrences of the AIMs of the test individual is practiced by performing six two-way 
comparisons, or by performing three three-way comparisons, or by perfor min g one four-way 
comparison among the groups; determining a likelihood of all possible proportional 
affiliations among the four ancestral groups having the greatest likelihood value, whereby a 
population structure or proportional affiliation that correlates with, or is most likely given, the 
nucleotide occurrences of the AIMs of the test individual is identified; and identifying a 
single proportional combination of maximum likelihood. In one aspect of this embodiment, 
the method can further include generating a graphical representation of the comparison of the 
three ancestral groups, wherein the graphical representation comprises a pyr ami d with each 
ancestral group independently represented by a vertex of the pyramid, and wherein the 
maximum likelihood value of proportional affiliation for an individual comprises a point 
within the pyramid. If desired, the graphical representation can further include a confidence 
contour comprising a sphere around the point, wherein the sphere indicates a level of 
confidence associated with estimating the proportional ancestry. 

[0029] The method of estimating, with a predetermined level of confidence, proportional 

ancestry of at least two ancestral groups of a test individual by identifying a population 
structure indicative of the proportional ancestry can further include identifying a 
sub-population structure indicative of ethnicity associated with one of the ancestral groups for 
which the test individual has a proportional ancestry. According to this method, a 
sub-population structure of the population structure that correlates with the nucleotide 
occurrences of the AIMs in the test individual is identified, wherein the sub-population 
structure correlates with ethnicity of the test individual. Such a method of identifying a 
sub-population structure can be performed, for example, by identifying those chromosomes 
of the test individual that contain the AIMs indicative of affiliation with a BioGeographical 
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ancestral group (where the individual is proportionally affiliated with more than one 
BioGeographical Ancestry group), contacting a sample including nucleic acid molecules of 
the test individual with second hybridizing oligonucleotides that can detect nucleotide 
occurrences of SNPs of a second panel of AIMs, wherein the AIMs of the second panel are 
informative for ethnicity within one of these groups and are present on the same 
chromosomes of the test individual that contain the AIMs indicative of the larger 
(intercontinental) ancestral group within which the ethnicity occurs; and identifying a 
sub-population structure that correlates with the nucleotide occurrences of the AIMs of the 
second panel, wherein the sub-population is indicative of ethnicity of the ancestral group of 
the test individual. 

[0030] According to such a method, using hybridizing oligonucleotides specific for the 
first panel of AIMs (e.g., AIMs of the 71 exemplified AIMs; SEQ ID NOS: 1 to 71), a test 
individual can be determined to be 60% IndoEuropean (IE) and 40% East Asian. In such a 
case, only a fraction of the total possible AIMs that can be indicative of the IE ancestral 
group will have been positive (if all were positive, the individual would have been 100% IE) 
and, therefore, only some of the individuals chromosomes or chromosomal regions will be of 
IndoEuropean origin. The chromosomes of the individual containing the positive AIMs for 
IE are then identified, and second hybridizing oligonucleotides specific for a second panel of 
AIMs are selected (e.g., from a group of .1000 or so AIMs that cover all 23 pairs of human 
chromosomes), wherein the AIMs of the second panel are limited to those that are highly 
variable in allele frequencies between IE ethnic groups and, therefore, indicative of IE 
ethnicity, and also are present on the chromosomes for which the first panel AIMs were IE 
positive. A sub-population structure that correlates with the nucleotide occurrences of the 
AIMs of the second panel is then identified, thus indicating an ethnicity with respect to the IE 
ancestral group of the test individual, for example, that the IE ancestral group derives from a 
Northern European, a Mediterranean, a Middle Eastern, or a South Asian Indian ethnicity. 

As such, the method provides a means to identify the ethnic origin of particular chromosomes 
(e.g., a Mediterranean origin of chromosomes previously determined to be of IndoEuropean 
origin) that contain AIMs that correlate with a population structure indicative of 
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IndoEuropean BioGeographical Ancestry, and further contain AIMs that correlate more 
specifically with a sub-population structure indicative of Mediterranean ethnicity. 

[0031] In another embodiment, the method of estimating proportional ancestry of a test 

individual can include generating an ancestral map of the world, wherein locations of 
populations having a proportional ancestry corresponding to the proportional ancestry of the 
test individual are indicated on the ancestral map. As such, the method can supplement 
genealogical information. For example, the method can further include overlaying the 
ancestral map with a genealogical map, wherein the genealogical map indicates locations of 
populations having geopolitical relevance with respect to the test individual, and statistically 
combining the information of the ancestral map and genealogical map to obtain a most likely 
estimate of family history of the test individual. 

[0032] Identifying a population structure that correlates with, or is most likely given, the 
nucleotide occurrences of the AIMs, according to a method of the invention, can be 
performed by comparing the nucleotide occurrences of the AIMs of the test individual with 
known proportional ancestries corresponding to nucleotide occurrences of AIMs indicative of 
BGA. The known proportional ancestries corresponding to nucleotide occurrences of AIMs 
indicative of BGA can be contained in a table or other list, and the nucleotide occurrences of 
the test individual can be compared to the table or list visually, or can be contained in a 
database, and the comparison can be made electronically, for example, using a computer. 
Further, each of the known proportional ancestries corresponding to nucleotide occurrences 
of AIMs indicative of BGA can be associated with a photograph of a person from whom the 
known proportional ancestry was determined, thus providing a means to further infer physical 
characteristics of a test individual. In one aspect, the photograph is a digital photograph, 
which comprises digital information that can be contained in a database that can further 
contain a plurality of such digital information of digital photographs, each of which is 
associated with a known proportional ancestry corresponding to nucleotide occurrences of 
AIMs indicative of BGA of the person in the photographs. 

[0033] In another aspect, a method of the invention can further include identifying a 
photograph of a person having a proportional ancestry corresponding to the proportional 
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ancestry of the test individual. Such identifying can be done by manually looking through 
one or more files of photographs, wherein the photographs are organized, for example, 
according to the nucleotide occurrences of AIMs of the person in the photograph. Identifying 
the photograph also can be performed by scanning a database comprising a plurality of files, 
each file containing digital information corresponding to a digital photograph of a person 
having a known proportional ancestry, and identifying at least one photograph of a person 
having nucleotide occurrences of AIMs indicative of BGA that correspond to the nucleotide 
occurrences of AIMs indicative of BGA of the test individual. 

[0034] Accordingly, the present invention also relates to an article of m anufacture, which 
is at least one photograph of a person having a known proportional ancestry correspon ding to 
a population structure comprising nucleotide occurrences of AIMs indicative of BGA, as well 
as to a plurality of such articles, each article of the plurality comprising one (or more) 
photograph(s) of a person having a known proportional ancestry corresponding to a 
population structure comprising nucleotide occurrences of AIMs indicative of BGA. The 
article can be contained in a file, or a plurality of the articles can be contained in a filed, for 
example, a file containing a plurality of photographs of different persons, wherein the some 
or all of the persons have the same or different known proportional ancestries that correspond 
to a population structure comprising nucleotide occurrences of AIMs indicative of BGA. 

[0035] Accordingly, a plurality of such articles is provided, as is a plurality of files, each 

file of which can contain one or more articles, i.e., photographs, which can be of one or more 
persons having the same or different known proportional ancestries that correspond to a 
population structure comprising nucleotide occurrences of AIMs indicative of BGA. For 
example, different files of the plurality each can contain one (or more) photograph(s) of one 
person having a known proportional ancestry corresponding to a population structure 
comprising nucleotide occurrences of AIMs indicative of BGA. Different files of the 
plurality also can contain photographs of two or more different persons, each of whom has 
the same or substantially the same proportional ancestry corresponding to a population 
structure comprising nucleotide occurrences of AIMs indicative of BGA. As such, a plurality 
of files can contain files, each of which contains one or more photographs of one or more 
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persons, and when containing one or more photographs of two or more different persons, the 
different persons can have the same or different known proportional ancestries. 

[0036] In one embodiment, the article of manufacture, i.e., the photograph of a person 
having a known proportional ancestry corresponding to a population structure comprising 
nucleotide occurrences of AIMs indicative of BGA, is a digital photograph, which comprises 
digital information. As such, the digital information of the digital photograph, or of a 
plurality of digital photograph articles of manufacture of the invention can be contained in a 
database. As such, the present invention further provides a plurality of the articles of 
manufactures, including at least two digital photographs each of which comprises digital 
information. In one aspect of this embodiment, the digital information for one or a plurality 
of the articles is contained in a database, which can be contained in any medium suitable for 
containing such a database, including, for example, computer hardware or software, a 
magnetic tape, or a computer disc such as floppy disc, CD, or DVD. As such, the database 
can be accessed through a computer, which can contain the database therein, can accept a 
medium containing the database, or can access the database through a wired or wireless 
network, e.g., an intranet or internet. 

[0037] The present invention also relates a kit, which contains a plurality of hybridizing 
oligonucleotides, each hybridizing oligonucleotide including at least fifteen contiguous 
nucleotides of a polynucleotide as set forth in SEQ ID NOS:l to 331, or a polynucleotide 
complementary thereto, and the plurality including at least five of such oligonucleotides, each ' 
based on different polynucleotides as set forth in SEQ ID NOS: 1 to 33 1 . In one embodiment, 
the hybridizing oligonucleotides that include at least fifteen contiguous nucleotides of at least 
five polynucleotides as set forth in SEQ ID NOS:l to 71, or polynucleotides complementary 
to any of SEQ ED NOStl to 71. 

[0038] The hybridizing polynucleotides of a kit of the invention can include probes, which 
are useful for detecting a particular AIM, including a particular nucleotide occurrence at the 
SNP position or DIP (deletion/insertion polymorphism) position of the AIM; can include 
primers, including primers useful for a primer extension reaction and primer pairs useful for a 
nucleic acid amplification reaction; or can include combinations of such probes and pr ime rs 
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In one embodiment, a hybridizing oligonucleotide of the plurality includes a nucleotide 
corresponding to nucleotide position of the AIM (e.g., nucleotide 50 of any of SEQ ID 
NOS. l to 34 and most others, nucleotide 56 of SEQ ID NO:35, nucleotide 44 of SEQ ID 
NO:50, or nucleotide 26 of SEQ ID NO:56), or to a nucleotide sequence complementary 
thereto, such a hybridizing oligonucleotide being useful as a probe to identify the presence or 
absence of a particular nucleotide occurrence at the SNP position of the AIM. 

[0039] In another embodiment, the kit contains at least one pair of hybridizing 
oligonucleotides useful for detecting the nucleotide occurrence(s) at the SNP (or DIP) 
position of an AIM. In one aspect of this embodiment, a pair of hybridizing oligonucleotides 
includes one oligonucleotide that hybridizes upstream and adjacent to the SNP position of an 
AIM and a second oligonucleotide that hybridizes downstream of and adjacent to the SNP (or 
DIP) position of the AIM, wherein one or the other of the pair further contains a nucleotide 
complementary to a nucleotide occurrence suspected of being at the SNP (or DIP) position of 
the AIM (i.e., one of the polymorphic nucleotides), such a pair of hybridizing 
oligonucleotides being useful in an oligonucleotide ligation assay. In another aspect of this 
embodiment, a pair of hybridizing oligonucleotides includes an amplification primer pair, 
including a forward primer and a reverse primer, such a pair of hybridizing oligonucleotides 
being useful for amplifying a portion of polynucleotide that includes the SNP (or DIP) 
position of the AIM. 

[0040] A kit of the invention can further contain additional reagents useful for practicing a 
method of the invention. As such, the kit can contain one or more polynucleotides 
comprising an AIM, including, for example, a polynucleotide containing an AIM for which a 
hybridizing oligonucleotide or pair of hybridizing oligonucleotides of the kit is designed to 
detect, such polynucleotide(s) being useful as controls. Further, hybridizing oligonucleotides 
of the kit can be detectably labeled, or the kit can contain reagents useful for detectably 
labeling one or more of the hybridizing oligonucleotides of the kit, including different 
detectable labels that can be used to differentially label the hybridizing oligonucleotides; such 
a kit can further include reagents for linking the label to hybridizing oligonucleotides, or for 
detecting the labeled oligonucleotide, or the like. A kit of the invention also can contain, for 
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example, a polymerase, particularly where hybridizing oligonucleotides of the kit include 
primers or amplification primer pairs; or a ligase, where the kit con tains hybridizing 
oligonucleotides useful for an oligonucleotide ligation assay. In addition, the kit can contain 
appropriate buffers, deoxyribonucleotide triphosphates, etc., depending, for example, on the 
particular hybridizing oligonucleotides contained in the kit and the purpose for which the kit 
is being provided. 

BRIEF DESCRIPTION OF THE DRAWINGS 
[ 0041 ] Figure 1 provides a diagram indicating the fashion in which chromosomal 
segments are shuffled by recombination over time in an admixed population. Initially, the 
parental populations have chromosomal segments that are continuous with respect to AIMs 
along the segment. In the first filial (FI) generation all persons have one complete 
chromosomal segment from each parental population. In the F2 generation, many more 
combinations are possible. The relative likelihood of the non-reco mbinant vs. the 
recombinant genotypes shown in F2 is dependent on the size of the chromosomal segment. 
Segments of the order of the size of human chromosomes will average several recombination 
events in a single meiosis (one recombination is equally likely every 50 cM of genetic 
distance). F3 shows an example of a likely genotype for a person with two parents from the 
F2 generation. F(N) x FI diagrams a genotype of a person with one F(N) parent and one FI 
parent; and F(N) x F2 diagrams a genotype of a person with one F(N) parent and one F2 
parent. 

[ 0042 ] Figures 2A and 2B show triangle graphs generated using the algorithm described 
in Example 6 (see, also, Table 12). NAM, Native American; AFR, sub-Saharan African; 
EUR, IndoEuropean. 

[ 0043 ] Figure 2A illustrates extension of a line from the NAM vertex to the opposite leg 
of the triangle, wherein the opposite leg represents 0% Native American ancestry. A circle is 
shown at the position of the estimated proportional ancestry (see Figure 2B), with the hatch 
mark on the line indicated the percent of Native American ancestry (approximately 15%). 
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[0044] Figure 2B shows additional lines drawn from the AFR and EUR vertices. The 
position on each line corresponding to the position of the circle represents the proportion of 
each respective ancestry; i.e., 15% Native American, 60% IndoEuropean, and 25% African. 

[0045] Figure 3 shows a triangle plot depicting one approach to illustrate the value and 
precision of individual ancestry estimates. Typical distributions of three populations are 
shown (European Americans: filled squares; African-Americans: open triangles; and an 
African/Native American population: open circles). Also shown is a single individual with 
likelihood intervals represented as concentric rings surrounding the point estimate (filled 
circle). Like a topological map, each concentric ring represents a decrease in the likelihood 
by 1 log unit (10 times less likely). In this example, the individual has a likelihood interval 
space that is symmetrical and circular. Interval spaces will take many shapes depending on 
the admixture proportions of the subject in question and the allele frequencies of the markers 
that have been typed. 

[0046] Figure 4 provides a triangular plot showing average admixture estimates for three 
African-American samples (filled circles: WASH-Washington DC, AFCAR-AfroCaribbeans, 
and BOG-Bogalusa), a European- American sample (open circle: SCO-State College), and a 
Spanish-American sample (open diamond: SLV-San Luis Valley CO). In parenthesis is 
shown the average African (AFR), IndoEuropean (EUR), and Native American (NAM) 
genetic contribution to each sample. 

[0047] Figures 5A and 5B show the genetic structure in US resident populations. 

[0048] Figure 5A shows the percentage of unlinked AIMs showing significant association. 
Expected values are based on a 5% significance level. Values for the Washington DC sample 
are based on 33 AIMs, for San Luis Valley CO on 19 AIMs, and for State College PA on 34 
AIMs. 

[0049] Figure 5B shows the correlation between individual ancestry estimates based on 
independent subsets of informative markers. Average correlation is based on 100 replicates. 
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The total number of markers is the same as for Figure 5 A. The corresponding p values are 
indicated at the bottom of the graph. 

[0050] Figures 6A and 6B show the triangle plots for a father (Figure 6A) and mother 
(Figure 6B). 

[0051] Figures 7A to 7C show the triangle plots for each of three children of the father 

and mother represented in Figure 6. 

[0052] Figure 8 shows the distribution of AIMs in the genome (chrom. number, 
chromosome number). 

[0053] Figures 9 A and 9B demonstrate the robustness of BGA admixture proportion 

analysis using AIMs (see Example 2). The confidence (contour lines) of the maximum 
likelihood estimate (MLE; point) is predictably affected by the elimination of AIMs 
informative for a particular pair-wise comparison. The first contour line extending from the 
MLE defines the triangle plot space within which the likelihood is 2 times lower than that of 
the MLE, and the second contour line defines the space in which the likelihood is 5 times 
lower than the MLE. 

[0054] Figure 9 A shows the MLE and confidence contours obtained using 71 AIMs; 

actual percentages are indicated. 

[0055] Figure 9B shows the results obtained after eliminating those AIMs used to obtain 
the results shown in Figure 9A from the analysis that are informative for East Asian-Native 
American distinction. The MLE is relatively unaffected, and the confidence contours along 
the East Asian-Indo. European (European) and Native American-European axes remain 
undistorted, but the confidence contours are distorted along the East Asian-Native American 
axis. 

[0056] Figure 1 0 shows the BGA admixture proportions determined for each of eight 

individuals of a family pedigree. Circles represent females, squares males and the BGA 
affiliation for each individual is shown as a fraction where the numerator represents 
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IndoEuropean BGA and the denominator represents Native American BGA. None of the 
individuals harbored sub-Saharan African or East Asian BGA except as indicated by the 
asterisk (*), which indicates that the individual was determined to be of 4 % East Asian BGA. 

[0057] Figure 1 1 shows a family tree demonstrating how a Chinese great grandparent in 

an otherwise IndoEuropean family tree can produce a grandchild with IndoEuropean/East 
Asian ancestry. The individuals that are 100% East Asian (Chinese) are shown with shading; 
the admixture results for the male (square) at the bottom of the pedigree (short arrow) are of 
interest. The grandparent indicated by the long arrow is about a 50%/50% East 
Asian/IndoEuropean mix, and her daughter, the subject's mother, is expected to be a 
25%/75% East Asian/IndoEuropean mix (see Example 3). 

[0058] Figure 12 shows the distribution of all SNPs available for genotyping by 
chromosomal arm for a group of patients treated for elevated cholesterol levels. 

[0059] Figure 13 shows the distribution of SNPs in Caucasian individuals taking Lipitor™ 
(n = 180) for whom response was known in terms of cholesterol (lip TC), low density 
lipoprotein (lip LDL), liver transaminase AST-SGOT (lip SGOT) and ALT-GPT (lip GPT) 
measurements. SNPs with delta values of significance (>0.20) among the various trait classes 
were selected. For example, in about 70% of patients, Lipitor™ causes a decrease in LDL. 
For any given SNP, the delta value (8) is the difference in minor allele frequency among 
those individuals for whom LDL decreased by at least 20% versus those for whom LDL did 
not change. 

[0060] Figure 14 shows a similar analysis as for Figure 13, except that response is 
measured following treatment with Zocor™ (n =150), and only total cholesterol (zoc TC) 
and LDL (zoc LDL) were examined. 

[0061] Figure 1 5 shows a distribution of SNPs (5 > 0. 1 1) among chromosome for 1 ,000 
individuals of known eye color. 

[0062] Figure 16 shows a distribution of SNPs (5 > 0.1 1) among chromosome for 1,000 
individuals of known hair eye color. 
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DETAILED description of the invention 

[0063] The present invention is based on the identification of ancestry informative 
markers (AIMs) useful for inferring a level of population structure of an individual, which, in 
turn, allows an inference as to various traits of the individual. Further, the AIMs of the 
present invention are demonstrated to correlate with a trait, regardless of whether the marker 
is in linkage disequilibrium with a gene or locus known to be involved in the trait. As such, 
the AIMs of the present invention are distinguishable from previously described markers, 
which only were considered useful if they were linked with a trait, i.e., if the marker was 
physically close to a gene known to be involved in the trait as characterized, for example, in 
having a low cross-over percentage with respect to gene (or locus) known to be involved in 
(or associated with) the trait. In contrast, there is no requirement that the markers (AIMs) 
useful in the present methods be in linkage disequilibrium with a gene/trait and, in fact, AIMs 
that are disclosed herein as correlating with a trait can be located on different chromosomes 
from each other and from a gene/locus known to be associated with the trait. 

[0064] AIMs are genetic loci that show alleles with high frequency differences between 
populations. AIMs are exemplified herein generally by single nucleotide polymorphisms 
(SNPs; see, e.g., SEQ ID NO:l), as well as by deletion/insertion polymorphisms (DIPs; see, 
e.g., SEQ ED NO:363). As disclosed herein, AIMs can be used to estimate BioGeographical 
Ancestry (BGA) of an individual or collection of individuals at the population level (in terms 
of races), at the sub-population level (in terms of ethnicities), and at the micro-group level (in 
terms of familial lines within ethnic groups), as well as at a practical, phenotypically qualified 
level (e.g., cases and controls). Such ancestry estimates at the subgroup and individual level 
can be directly instructive regarding the genetics of phenotypes that are different qualitatively 
or in frequency between populations, including, for example, the likelihood that an individual 
will respond to a particular medication or the propensity of an individual to develop a disease. 
Ancestry estimates also can provide a compelling foundation for the use of Admixture 
Mapping (AM) methods to identify the genes underlying these traits. 

[0065] As exemplified herein, apanel of 71 AIMs (SEQ ID NOS:l to 71) was identified 
from an examination of over 800 candidate AIMs (see, also, SEQ ID NOS:72 to 331), and 
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methods were developed to examine these AIMS as a means to obtain accurate estimates of 
0 , 

proportional ancestry. The methods and markers of the invention have been validated in 
studies using skin pigmentation as a model phenotype (see, also. Inti. Publ. 

No. WO 02/097047 (PCT/US02/16789), which is incorporated herein by reference), initial 
markers were genotyped in two population samples with primarily African ancestry, African 
Americans from Washington D.C. and an African Caribbean sample from En gland and in a 
sample of European Americans from Pennsylvania (see Example 1). In the two African 
population samples, very strong correlations were observed between estimates of individual 
ancestry and skin pigmentation as measured by reflectometry (R 2 = 0.21, p < 0.0001 for the 
African-American sample and R 2 = 0.16, p < 0.0001 for the British African-Caribbean 
sample). These correlations confirmed the validity of the ancestry estimates and also 
indicated the high level of population structure related to admixture, which characterizes 
these populations and is detectable using other tests to identify genetic structure. These 
results demonstrate that an estimate of an individual's ancestry can be made based on a DNA 
analysis using a relatively small number of well defined genetic markers (AIMs). 

[0066] The methods and genetic markers disclosed herein provide tools for several distinct 
purposes, including, for example, 1) for the estimation of ancestry proportions in individuals 
from their DNA; 2) for the estimation of genetic structure for the control of study designs 
commonly used for genetic research; 3) for the construction of physical profiles through the 
inference of characteristics related to ancestry, which may have implications in forensic 
investigations; 4) for the identification of disease predisposition, referred to as "Mapping by 
Ancestry Linkage Disequilibrium" (MALD); and 5) for predicting a significant portion of an 
individual patient's response to prescription and over-the-counter medications. As such, the 
present invention provides, for example, 1) statistical methods for the determination of 
ancestral proportions from genetic sequences within individuals and examples of use; 

2) several hundred AIMs culled from the publicly available single nucleotide polymorphism 
(SNP) database and identified using statistical methods as useful for the determination of 
ancestral proportions within individuals or study groups; 3) several hundred AIMs that are 
demonstrated as useful for the determination of ancestral proportions within individuals or 
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study groups; and 4) software programs that can be used for the determination of ancestral 
proportions within individuals or study groups. 

[0067] Previously, efforts have been made to control the two sources of population 
structure, including sampling effects and natural human demography, which were believed to 
confound efforts to identify markers of genes associated with particular traits. However, as 
disclosed herein, population structure is reflective of human demography, and markers that 
correlate with a trait value are useful as reporters of structure that correlate with trait value 
(rather than markers in LD with phenotypically active loci), and, therefore, provide a valuable 
tool that enables accurate classification in a cost-effective and practical manner. Alleles 
associated with a trait due to population structure are not linked to phenotypically active loci, 
but are merely correlated with trait value because they are enriched for in branches of the 
human family tree for which the trait value is more common. As disclosed herein, the 
distribution of trait values among the various branches of the human family tree are such that 
accurate classification can be obtained only through an appreciation of that structure, rather 
than a full understanding of the biological mechanism of the trait, and, as a result, markers 
that were considered false positives when considered with respect to their use for identifying 
phenotypically active loci, in fact, can enable accurate classification analysis; i.e., they are 
true positives provided the structure from which they were derived is reflective of human 
demography rather than sampling effects. The present methods are based on correlation 
between markers and BGA, where BGA is itself on some level of complexity correlated with 
a trait value, not linkage or linkage disequilibrium. 

[0068] Accordingly, the present invention provides a method of inferring, with a 

predetermined level of confidence, a trait of an individual. In one embodiment, a method of 
the invention is performed by contacting a nucleic acid sample of a test individual with 
hybridizing oligonucleotides that can detect nucleotide occurrences of single nucleotide 
polymorphisms (SNPs) of a panel of at least about ten AIMs; and identifying, with a 
predetermined level of confidence, a population structure that correlates with, or is most 
likely given, the nucleotide occurrences of the AIMs in the individual, wherein the population 
structure correlates with a trait The panel of AIMs are selected on their delta value (see 
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below) and, where relevant, based on the particular platform used to perform the method, and 
are indicative of a population structure correlated with the trait. AIMs are exemplified herein 
by the polynucleotides set forth as SEQ ID NOS:l to 33 1, wherein the SNP position 
generally is at nucleotide position 50 (but see, e.g., SEQ ID NO:35, nucleotide 56; SEQ ID 
NO:51, position 48; SEQ ID NO:56, position 26). 

[0069] A test individual for whom a trait is to be inferred can be any individual for whom 
it is desired to infer a trait, and generally is a human. However, the methods of the invention 
also can be used for inferring traits of other mammals, including, for example, domestic 
animals such as cats, dogs, or horses; farm animals such as cattle, sheep, pigs, or goats; or 
other animals. The trait to be examined can be any trait of interest, including, as exemplified 
herein, proportional ancestry (BGA); hair, skin or iris pigmentation; or drug responsiveness. 

[0070] The methods of the invention are particularly useful because they allow for an 

inference to be made of a desired trait with a predetermined level of confidence. As used 
herein, reference to a "predetermined level of confidence" means that an inference or estimate 
of the invention is made using statistical methods that provide a confidence interval to be 
determined about a mean or a maximum likelihood value. In addition to determining the 
maximum likelihood value of within-individual or within-sample structure, other similarly 
likely values can also be determined and these can be combined to define the x-fold 
likelihood confidence intervals, where x is any number such as 2, 5 or 10. For example, all of 
the structure results corresponding to a likelihood value 10 times lower than the Ma ximu m 
Likelihood Value can be plotted or listed to define the 10-fold likelihood confidence interval. 
As for any statistical test, an assay of the invention is designed such that performance of the 
test results in a value having a desired confidence level. As disclosed herein, a method of the 
invention can be performed such that the result has a predetermined level of confidence by 
varying the number of AIMs examined with respect to a trait. For example, use of a certain 
panel of ten AIMs will allow an inference to be made as to whether an individual has a 
particular trait, e.g., responsiveness to Lipitor™, with a certain level of confidence, whereas 
use of a panel of twenty AIMs, which can, but need not be partially overlapping with the 
panel of ten AIMs, will allow the same inference to be made, but with a higher level of 
confidence. Similarly, use of two panels of ten AIMs each can allow an inference to be made 
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that an individual has, for example, 80% IndoEuropean ancestry and 20% East Asian 
ancestry (with an error, e.g., of ± 10%), whereas the use of two panels of twenty AIMs each 
can allow the same inference, but with an error, e.g., of + 5%. 

[007 1] A sample useful for practicing a method of the invention can be any biological 

sample of a test individual that contains nucleic acid molecules, including portions of the 
gene sequences containing AIMs that are to be examined or, wherein the polymorphism of an 
AIM results in an amino acid change in an encoded polypeptide, any biological sample that 
contains the encoded polypeptides. As such, the sample can be a cell, tissue or organ sample, 
or can be a sample of a biological fluid such as semen,, saliva, blood, cerebrospinal fluid, and 
the like. 

[0072] A nucleic acid sample useful for practicing a method of the invention will depend, 
in part, on whether the SNPs to be identified are in coding regions or in non-coding regions. 
Where one or more SNPs is present in a non-coding region of a gene, the nucleic acid sample 
generally is a deoxyribonucleic acid (DNA) sample, particularly genomic DNA or an 
amplification product thereof. However, where the AIM is contained within a transcribed 
sequence, e.g., rDNA, microsatellite DNA, or heteronuclear ribonucleic acid (RNA), which 
includes unspliced mRNA precursor RNA molecules including non-coding RNA sequence, 
an RNA sample can be used and examined directly, or a cDNA or amplification product 
thereof can be examined according to the present methods. Where one or more SNPs is 
present in a coding region of a gene, the nucleic acid sample can be DNA or RNA, or 
products derived therefrom, for example, amplification products. Furthermore, while the 
methods of the invention are exemplified with respect to a nucleic acid sample, it will be 
recognized that particular SNPs, when present in coding regions of a gene, can result -in 
polypeptides containing different amino acids at the positions corresponding to the SNPs due 
to non-degenerate codon changes. As such, in one aspect, the methods of the invention are 
practiced using a sample containing polypeptides of the subject. 

[0073] A method of the invention is performed by contacting the sample and hybridizing 
oligonucleotides under conditions suitable for detecting the nucleotide occurrences of the 
AIMs of the individual by the hybridizing oligonucleotides. Further, in aspects of the 
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methods of the invention, the sample can be contacted with second hybridizing 
oligonucleotides, for example, to determine a sub-population structure. It should be 
recognized that the term "second", when used in reference to hybridizing oligonucleotides (or 
to a panel of AIMs), is used for convenience of discussion so as to allow a clear distinction, 
e.g., of steps for performing a method. In this respect, it should be further recognized that 
one or more hybridizing oligonucleotides used, e.g., to determine a population structure, also 
can be included among the second hybridizing oligonucleotides. 

[0074] Conditions suitable for detecting the nucleotide occurrences of AIMs will vary 
depending on the sequences of the hybridizing oligonucleotides, including their length and 
complementarity, as well as on the particular assay being used and, for example, whether the 
assay is being performed as a multiplex assay. The hybridizing oligonucleotides, which are 
at least 15 nucleotides in length, can contain deoxyribonucleotides or ribonucleotides, which 
are linked together by a phosphodiester bond, and can be single stranded or double stranded, 
though they generally are used in a single stranded form. Such hybridizing oligonucleotides 
can be prepared using methods of chemical synthesis or by enzymatic methods such as by the 
polymerase chain reaction (PCR). 

[0075] The hybridizing oligonucleotides, or other polynucleotides useful in a methods or 
contained in a kit of the invention also can contain nucleoside or nucleotide analogs, and can 
have a backbone bond other than a phosphodiester bond, such oligonucleotides providing 
certain advantages such as having increased stability or more desirable hybridization 
properties. Nucleotide analogs are well known in the art and commercially available, as are 
polynucleotides containing such nucleotide analogs (Lin et al., Nucl. Acids Res. 

22:5220-5234, 1994; Jellinek et al., Biochemistry 34:11363-11372, 1995; Pagratis et al., 
Nature Biotechnol. 15:68-73, 1997, each of which is incorporated herein by reference). The 
covalent bond also can be any of numerous other bonds, including a thiodiester bond, a 
phosphorothioate bond, a peptide-like bond or any other bond known to those in the art as 
useful for linking nucleotides to produce synthetic oligonucleotides (see, for example, Tam et 
al., Nucl. Acids Res. 22:977-986, 1994; Ecker and Crooke, BioTechnology 13:351360, 1995, 
each of which is incorporated herein by reference). The incorporation of non-naturally 
occurring nucleotide analogs or bonds Unking the nucleotides or analogs can be particularly 
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useful where the oligonucleotide is to be exposed to an environment that can contain a 
nucleolytic activity, including, for example, a tissue culture medium or sample comprising a 
cell extract because the modified oligonucleotides can be less susceptible to degradation. 

[0076] Generally, the hybridizing oligonucleotides useful for purposes of the present 
invention are at least about 15 bases in length, which is sufficient to permit the 
oligonucleotide to selectively hybridize to a target polynucleotide comprising the AIM, and 
can be at least about 18 nucleotides or 21 nucleotides or 25 nucleotides or more in length. 

The term "selective hybridization" or "selectively hybridize" refers to hybridization under 
moderately stringent or highly stringent physiological conditions, which can distinguish 
related nucleotide sequences from unrelated nucleotide sequences. In nucleic acid 
hybridization reactions, the conditions used to achieve a particular level of stringency are 
known to vary, depending on the nature of the nucleic acids being hybridized, including, for 
example, the length, degree of complementarity, nucleotide sequence composition (e.g., 
relative GC:AT content), and nucleic acid type, i.e., whether the oligonucleotide or the target 
nucleic acid sequence is DNA or RNA. An additional consideration is whether one of the 
nucleic acids is immobilized, for example, on a filter, bead, chip, or other solid matrix. 

[0077] Methods for selecting appropriate stringency conditions can be determined 
empirically or estimated using various formulas, and are well known in the art (see, for 
example, Sambrook et al., supra , 1989). An example of progressively higher stringency 
conditions is as follows: 2X SSC/0.1% SDS at about room temperature (hybridization 
conditions); 0.2X SSC/0.1% SDS at about room temperature (low stringency conditions); 
0.2X SSC/0. 1% SDS at about 42°C (moderate stringency conditions); and 0.1X SSC at about 
68°C (high stringency conditions). Washing can be carried out using only one of these 
conditions, for example, high stringency conditions, or each of the conditions can be used, for 
example, for 10 to 15 minutes each, in the order listed above, repeating any or all of the steps 
listed. As such, final conditions will vary, depending on the particular hybridization reaction 
involved, and can be determined empirically. It should be recognized that a variety of 
conditions can be utilized to provide selective hybridization conditions. For example, when a 
multiplex assay is to be performed using a plurality of different hybridizing oligonucleotides 
specific for different AIMs of a panel, the conditions (as well as the AIMs/hybridizing 
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oligonucleotides) can be selected such that selective hybridization occurs for all of the 
hybridizing oligonucleotides in the reaction. 

[0078] In various embodiments, it can be useful to detectably label a polynucleotide or 
hybridizing oligonucleotide. Detectable labeling of a polynucleotide is well known in the art 
and includes, for example, the use of detectable labels such as chemiluminescent labels, 
radionuclides, enzymes, haptens such as digoxygenin and biotin, fluorophores, and unique 
oligonucleotide sequences. For example, PCR products can be performed, wherein one 
primer is biotinylated and the other primer contains digoxygenin. The amplification products 
can then be bound to a streptavidin plate, washed, reacted with an enzyme-conjugated 
antibody to digoxygenin, and developed with a chromogenic, fluorogenic, or 
chemiluminescent substrate for the enzyme. Alternatively, a radioactive method can be used 
to detect generated amplification products, for example, by including a radiolabeled 
deoxynucleoside triphosphate into the amplification reaction, then blotting the amplification 
products onto DEAE paper for detection. In addition, if one primer is biotinylated, then 
streptavidin-coated scintillation proximity assay plates can be used to measure the PCR 
products. Additional methods of detection can use a chemiluminescent label, for e xam ple, a 
lanthanide chelate such as used in the DELFIA® assay (Pall Corp.), a fluorescent label, or an 
electrochemiluminescent label such as ruthenium tris-bipyridyl (ORI-GEN). 

[0079] Methods for detecting a nucleotide occurrence at a SNP or DP position of an AIM 
can utilize one or more oligonucleotide probes or primers, including, for example, an 
amplification primer pair, that selectively hybridize to a target polynucleotide spanning the 
AIM. Oligonucleotide probes useful in practicing a method of the invention can include, for 
example, an oligonucleotide that is complementary to and spans a portion of the target 
polynucleotide, including the position of the SNP (or DP), wherein the presence of a specific 
nucleotide at the position of the SNP is detected by the presence or absence of selective 
hybridization of the probe. Such a method can further include contacting the target 
polynucleotide and hybridized oligonucleotide with an endonuclease, and detecting the 
presence or absence of a cleavage product of the probe, depending on whether the nucleotide 
occurrence at the SNP site is complementary to the corresponding nucleotide of the probe. A 
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pair of probes that specifically hybridize upstream and adjacent and downstream and adjacent 
to the site of the SNP , wherein one of the probes includes a nucleotide complementary to a 
nucleotide occurrence of the SNP , also can be used in an oligonucleotide ligation assay, 
wherein the presence or absence of a ligation product is indicative of the nucleotide 
occurrence at the SNP site. An oligonucleotide also can be useful as a primer, for example, 
for a primer extension reaction, wherein the product (or absence of a product) of the 
extension reaction is indicative of the nucleotide occurrence. In addition, a primer pair useful 
for amplifying a portion of the target polynucleotide including the SNP or DIP site can be 
useful, wherein the amplification product is examined to determine the nucleotide occurrence 
at the SNP site or to determine whether there is an insertion or a deletion at the DIP site. 

[0080] Numerous methods are known for determining a nucleotide occurrence at a 
particular position in a polynucleotide (i.e., of a SNP or DIP). Such methods can utilize one 
or more oligonucleotide probes or primers, including, for example, an amplification primer 
pair, that selectively hybridize to a target polynucleotide, which contains one or more SNP 
positions. Hybridizing oligonucleotide useful in practicing a method of the invention can 
include, for example, an oligonucleotide that is complementary to and spans a portion of the 
target polynucleotide, including the position of the SNP or DIP (including whether the DP 
has a deletion or insertion), wherein the presence of a specific nucleotide at the SNP site or 
the presence of a deletion or insertion at the DP site is detected by the presence or absence of 
selective hybridization of the oligonucleotide probe. Such a method can further include 
contacting the target polynucleotide and hybridized oligonucleotide with an endonuclease, 
and detecting the presence or absence of a cleavage product of the probe, depending on 
whether the nucleotide occurrence at the SNP site is complementary to the corresponding 
nucleotide of the probe. 

[0081] An oligonucleotide ligation assay also can be used to identify a nucleotide 
occurrence at a SNP site, wherein a pair of probes that selectively hybridize upstream and 
adjacent to and downstream and adjacent to the site of the SNP, and wherein one of the 
probes includes a terminal nucleotide complementary to a nucleotide occurrence of the SNP. 
Where the terminal nucleotide of the probe is complementary to the nucleotide occurrence, 
selective hybridization includes the terminal nucleotide such that, in the presence of a ligase, 
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the upstream and downstream oligonucleotides are ligated. As such, the presence or absence 
of a ligation product is indicative of the nucleotide occurrence at the SNP site. 

[0082] A hybridizing oligonucleotide also can be useful as a primer, for example, for a 
primer extension reaction, wherein the product (or absence of a product) of the extension 
reaction is indicative of the nucleotide occurrence at a SNP site or an insertion or deletion at a 
DIP site. In addition, a primer pair useful for amplifying a portion of the target 
polynucleotide including the SNP or DIP site can be useful, wherein the amplification 
product is examined to determine the nucleotide occurrence at the SNP site or the presence of 
a deletion or an insertion at the DIP site. Particularly useful methods include those that are 
readily adaptable to a high throughput format, to a multiplex format, or to both. 

[0083] Conditions that allow generation of an amplification product in a sample in which 
an amplification reaction is being performed are such that the reaction contains the necessary 
components for the amplification reaction to occur. Such conditions include, for example, 
appropriate buffer capacity and pH, salt concentration, metal ion concentration if necessary 
for the particular polymerase, appropriate temperatures that allow for selective hybridization 
of the primer or primer pair to the template target polynucleotide, as well as appropriate 
cycling of temperatures that permit polymerase activity and melting of a primer or primer 
extension or amplification product from the template or, where relevant, from forming a 
secondary structure such as a stem-loop structure. Such conditions and methods for selecting 
such conditions are routine and well known in the art (see, for example, Innis et al., "PCR 
Strategies" (Academic Press 1995); Ausubel et al., "Short Protocols in Molecular Biology" 

4th Edition (John Wiley and Sons, 1999), each of which is incorporated herein by reference). 

[0084] A primer extension or amplification product can be detected directly or indirectly 
and/or can be sequenced using various methods known in the art. Amplification products 
that span a SNP site can be sequenced using traditional sequence methodologies, including, 
for example, the dideoxy-mediated chain termination method (Sanger et al., J. Molec. Biol. 
94:441, 1975; Prober et al. Science 238:336-340, 1987) or the chemical degradation method 
(Maxam et al., Proc. Natl. Acad. Sci. USA 74:560, 1977) to determine the nucleotide 
occurrence at the SNP loci. 
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[0085J The nucleotide occurrence at a SNP site also can be determined using a 
microsequencing method, wherein the identity of only a single nucleotide is determined at a 
predetermined site (U.S. Pat. No. 6,294,336). Microsequencing methods include the Genetic 
Bit Analysis method (WO 92/15712). Additional, primer-guided, nucleotide incorporation 
procedures for assaying polymorphic sites in DNA have also been described (Komher et al., 
Nucl. Acids. Res. 17:7779-7784, 1989; Sokolov, Nucl. Acids Res. 18:3671, 1990; Syvanen et 
al.. Genomics 8:684-692, 1990; Prezan et al, Hum. Mutat. 1:159-164, 1992; Nyren et al.. 
Anal, Biochem. 208:171-175, 1993). These methods differ from Genetic Bit™. Analysis in 
that they all rely on the incorporation of labeled deoxyribonucleotides to discriminate 
between bases at a polymorphic site. In such a format, the signal is proportional to the 
number of deoxyribonucleotides incorporated, and polymorphisms that occur in runs of the 
same nucleotide generate signals that are proportional to the length of the run (Syvanen et al. 
Amer. J. Hum. Genet. 52:46-59, 1993). 

[0086] Another method for determining the nucleotide occurrence at a SNP position is 
described by Macevicz (U.S. Pat. No. 5,002,867), wherein a nucleic acid sequence is 
determined via hybridization with multiple mixtures of oligonucleotide probes. In 
accordance with such a method, the sequence of a target polynucleotide is determined by 
permitting the target to sequentially hybridize with sets of probes having an invariant 
nucleotide at one position, and a variant nucleotides at other positions. The nucleotide 
sequence is determined by hybridizing the target with a set of probes, then determining the 
number of sites that at least one member of the set is capable of hybridizing to the target 
(i.e., the number of matches). This procedure is repeated until each member of a sets of 
probes has been tested. U.S. Pat. No. 6,294,336 provides a solid phase sequencing method 
for determining the sequence of nucleic acid molecules (either DNA or RNA) by utilizing a 
primer that selectively binds a polynucleotide target at a site wherein the SNP is the most 
3' nucleotide selectively bound to the target. 

[0087] The nucleotide occurrence of a SNP in a sample also can be determined using the 
SNP-IT™ method (Orchid BioSciences, Inc., Princeton, NJ). hi general, SNP-IT™ is a 
3 -step primer extension reaction. In the first step a target polynucleotide is isolated from a 
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sample by hybridization to a capture primer, which provides a first level of specificity. In a 
second step the capture primer is extended from a terminating nucleotide trisphosphate at the 
target SNP site, which provides a second level of specificity. In a third step, the extended 
nucleotide trisphosphate can be detected using a variety of known formats, including: direct 
fluorescence, indirect fluorescence, an indirect colorimetric assay, mass spectrometry, 
fluorescence polarization, etc. Reactions can be processed in 384 well format in an 
automated format using a SNPstream™ instrument (Orchid BioSciences, Inc., Princeton, NJ). 
Phase known data can be generated by inputting phase unknown raw data from the 
SNPstream™ instrument into the Stephens and Donnelly’s PHASE program. 

[0088] Melting curve analysis of SNPs (McSNP® analysis) provides another method for 
detecting a nucleotide occurrence in an AIM (Akey et al., supra, 2001). McSNP® analysis 
provides the additional advantages that it does not require a step of gel electrophoresis, thus 
minimizing the time and cost for detecting a SNP, and that it is readily adaptable to high 
throughput formats, thus allowing examination of one or more panels of AIMs and/or 
samples in parallel. 

[0089] Where the particular nucleotide occurrence of a SNP is such that the nucleotide 
occurrence results in an amino acid change in an encoded polypeptide, the nucleotide 
occurrence can be identified indirectly by detecting the particular amino acid in the 
polypeptide. The method for determining the amino acid will depend, for example, on the 
structure of the polypeptide or on the position of the amino acid in the polypeptide. Where 
the polypeptide contains only a single occurrence of an amino acid encoded by the particular 
SNP, the polypeptide can be examined for the presence or absence of the amino acid. For 
example, where the amino acid is at or near the amino terminus or the carboxy terminus of 
the polypeptide, simple sequencing of the terminal amino acids can be performed. 
Alternatively, the polypeptide can be treated with one or more enzymes and a peptide 
fragment containing the amino acid position of interest can be examined, for example, by 
sequencing the peptide, or by detecting a particular migration of the peptide following 
electrophoresis. Where the particular amino acid comprises an epitope of the polypeptide, 
the specific binding, or absence thereof, of an antibody specific for the epitope can be 
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detected. Other methods for detecting a particular amino acid in a polypeptide or peptide 
fragment thereof are well known and can be selected based, for example, on convenience or 
availability of equipment such as a mass spectrometer, capillary electrophoresis system, 
magnetic resonance imaging equipment, and the like. 

[0090] In another embodiment, a method of the invention utilizes an antibody, or antigen 

binding fragment thereof, that specifically binds, for example, to a polypeptide comprising an 
amino acid encoded by a nucleotide sequence comprising one nucleotide occurrence of a 
SNP, but not substantially to a polypeptide comprising an different amino acid encoded by 
the codon comprising the SNP; or that specifically binds, for example, to a polypeptide 
comprising an amino acid sequence encoded by one form a DIP (e.g., that having the 
insertion), but not substantially to that encoded by the alternative form (e.g., that having the 
deletion). As used herein, the term "specific interaction," or "specifically binds" means that 
two molecules form a complex that is relatively stable under physiologic conditions. The 
term is used herein to refer to various interactions, including, for example, the interaction of 
an antibody that binds a target polynucleotide including the SNP site only if the SNP has a 
specified, but not an alternative, nucleotide occurrence (e.g, an A, but not a T); or the 
interaction of an antibody that binds a polypeptide that includes one amino acid that is 
encoded by a codon that includes a SNP site, but not a polypeptide having an alternative 
amino acid encoded by the codon comprising the SNP. 

[0091] A specific interaction can be characterized by a dissociation constant of at least 
about 1 x 10‘ 6 M, generally at least about 1 x 10' 7 M, usually at least about 1 x 10' 8 M, and 
particularly at least about 1 x 10‘ 9 M or 1 x 10' 10 M or greater. A specific interaction 
generally is stable under physiological conditions, including, for example, conditions that 
occur, in a living individual such as a human or other vertebrate or invertebrate, as well as 
conditions that occur in a cell culture such as used for maintaining mammali an cells or cells 
from another vertebrate organism or an invertebrate organism. Methods for determining 
whether two molecules interact specifically are well known and include, for example, 
equilibrium dialysis, surface plasmon resonance, and the like. 
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[0092] Antibodies useful in a method of the invention include antibodies that specifically 
bind polynucleotides that encompass an AIM, or that bind polypeptides that include an ami™ 
acid encoded by a codon that includes a SNP or that include amino acids due to an insertion 
at a DIP site. Such antibodies are selected such that they specifically bind a polypeptide that 
includes a first amino acid encoded by a codon that includes the SNP loci, but do not bind, or 
bind measurably more weakly to a polypeptide that includes a second amino acid encoded by 
a codon that includes a different nucleotide occurrence at the SNP. 

[0093] The term "antibody" is used broadly herein to refer to immunoglobulin molecules 
and antigen binding portions of immunoglobulin molecules that specifically bind an antigen. 
As such, antibodies useful in a method of the invention can be polyclonal, monoclonal, 
multispecific, human, humanized or chimeric antibodies, single chain antibodies. Fab 
fragments, F(ab') fragments, fragments produced by a Fab expression library, anti-idiotypic 
(anti-id) antibodies, and the like, as well as antigen/epitope binding fragments of such 
antibodies. Antigen binding fragments of antibodies include, but are not limited to. Fab, Fab' 
and F(ab')2, Fd, single-chain Fv's (scFv), single-chain antibodies, disulfide-linked Fv 
fragments (sdFv) and fragments comprising either a VL or VH domain. Thus, 
antigen-binding antibody fragments, including single-chain antibodies, can comprise the 
variable region(s) alone or in combination with the entirety or a portion of the hinge region, 
CHI, CH2, and/or CH3 domains. The antibodies can be from any animal origin including 
birds and mammals, or can be expressed recombinantly, for example, in insect or m ammalian 
host cells or in plants. 

[0094] There is much that can be learned today through the use of genetic markers in 
numerous scientific fields. The use of genetic sequences has become routine for forensics 
and disease research, but the majority of the benefits from the recently completed human 
genome project still await discovery. Within the genome exist sequences and patterns of 
sequences that will prove useful for a variety of purposes including increasing crop yields, 
extending human life spans, minimizing the suffering caused by drugs and enhancing the 
quality of our lives through better, more effective and specific treatments. Until now, 
biomedical research has been conducted on relatively simple terms. Nevertheless, more one 
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thousand simple Mendelian traits have been mapped by following the transmission of genetic 
markers in families. 

[0095] Many statistical methods are available for studying genetic traits, including 
traditional family-based linkage analysis, variance component methods, sib-pair linkage, 
measured genotype, transmission disequilibrium, genomic control, and structure analysis. 
Some of the genes underlying variation in susceptibility to common diseases (e.g., heart 
disease, obesity, type 2 diabetes, hypertension, and cancer) eventually will be identified using 
genetic approaches. However, there are a number of complexities in genetic research on 
common diseases because many of these conditions are multifactorial (i.e., have several 
sources of variability in risk) and polygenic (i.e., result due to the actions and interactions 
among several genes). Additional difficulties in the study of common diseases can derive 
from the late onset of symptoms and heterogeneity in etiology. Thus, identifying the genes 
involved in complex diseases remains one of the greatest challenges in the field of human 
genetics. 

[0096] There has been increased interest in association studies as a useful approach to map 
common disease and drug response genes (Risch and Merikangas, Science 273:1516-1517, 
1996; Jorde, Genome Res. 10:1435-1444, 2000; Nordbdrg and Tavare, Trends Genet. 
18:83-90, 2002). Until the present disclosure, however, the implication of ancestry for 
identifying these genes has not been fully appreciated. As such, the methods of the invention 
provide a previously undescribed platform for the identification of genes associated with 
disease susceptibility and drug responsiveness, as well as for the development of advanced 
forensic methods. As such, compositions and methods are provided for inferring an 
individual's response to commonly used medications, which, remarkably, is a function of 
individual ancestry; the disclosed markers and methods are, to a differing extent for each 
drug, useful for the inference of such response. In addition, compositions and methods are 
provided for inferring individual and/or group ancestral proportions from knowledge of the 
individual's or group's DNA sequences. Further, compositions and methods are provided for 
using knowledge of ancestry relevant DNA sequences to identify disease susceptibility and 
drug response genes through the MALD process. Also, compositions and methods are 
provided for qualifying and normalizing study groups for more traditional methods of 
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mapping disease genes. Each of these processes requires an accurate knowledge of ancestry, 
which can be determined using the methods and compositions disclosed herein. 

[0097] The populations that will be best suited for linkage disequilibrium (LD) mapping 
has prompted much discussion and debate (see Wright et al., Nat. Genet. 23:397*404, 1999; 
Eaves et al., Nat. Genet 25:320-323, 2000; Nordborg and Tavare, supra, 2002; Kaessmann et 
al. , Amer . J. Hum. Genet. 70:673-685, 2002). The extent of LD is a complex function of a 
number of genetic and evolutionary factors such as mutation, recombination and gene 
conversion rates, demographic and selective events, and the age of the mutation itself. Some 
of these factors affect the whole genome, while others only affect particular genome regions. 
Additionally, variation in mutation, recombination, and gene conversion rates throughout the 
genome are expected to create LD differences between genomic regions (see, for example, 
Taillon-Miller et al., Nat. Genet. 25:324-328, 2000). 

[0098] It has been proposed that small, isolated and inbred populations will have 
advantages over other populations, due to the lower heterogeneity and the larger extent of 
linkage disequilibrium (Wright et al ., supra, 1999; Nordborg and Tavard, supra, 2002; 
Kaessmann et al., supra, 2002). Other populations well suited for mapping are recently 
admixed populations (e.g., Hispanics and African Americans), which offer the advantage that 
LD has been created recently due to the admixture process. Because this LD is recent, it can 
extend over large chromosomal regions. However, it is also extremely important to control 
for the genetic structure (inter-individual variation in admixture proportions) present in these 
populations in order to avoid false positives (Parra et al., supra, 1998; Lautenberger et al., 
Amer. J. Hum. Genet. 66:969-978, 2000; Pfaffet al .,Amer. J. Hum. Genet. 68:198-207, 2001; 
Nordborg and Tavare, supra, 2002, each of which is incorporated herein by reference). 
Interest in admixture mapping has increased in recent years (McKeigue et al., Ann. Hum. 
Genet. 64:171-186,2000; Smith et al., J. Invest. Dermatol. 111:119-122, 2001; Collins- 
Schramm et al., Amer. J. Hum. Genet. 70:737-750, 2002, each of which is incorporated 
herein by reference). A general description of admixture mapping is provided below, as are 
some details about a statistical approach developed for admixture mapping and its application 
to skin pigmentation as a model phenotype. 
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[0099] Admixture generates allelic associations between all marker loci where allele 
frequencies are different between the parental populations (Chakraborty and Weiss, Proc. 
Natl. Acad. Sci., USA 85:91 19-9123, 1988). These associations decay with time, in a way that 
is dependent on the genetic distance between them. Thus, disease (or trait) risk alleles that 
are different between the parental populations can be mapped in admixed populations using 
special panels of genetic markers showing high frequency differences between the parental 
populations. These markers, termed AIMs, are characterized by having particular alleles that 
are more common in one group of populations than in other populations. One measure of the 
informativeness of such markers is the allele frequency differential, delta (8), which is simply 
the absolute value of the difference of a particular allele between populations (Chakraborty 
and Weiss, supra, 1988; Dean et al., supra, 1994). 

[0100] In admixed populations, allelic associations were generated recently and, therefore, 
are more easily detected for a given sample size because they extend over longer distances 
than in non-admixed populations (up to 10-20 centiMorgans (cM) or more). The statistical 
basis of this approach was first explored by Chakraborty and Weiss {supra, 1988) and 
subsequently by Stephens, Briscoe and O'Brien, who named the method "mapping by 
admixture linkage disequilibrium" (MALD; Stephens et aL, Amer. J. Hum. Genet. 

55:809-824, 1994; Briscoe et al., J. Hered. 85:59-63, 1994). Further, whether one is using a 
MALD approach or a more traditional LD approach for genetic research, to eliminate 
associations of the trait with alleles at unlinked loci, it is necessary to control in the analysis 
for individual ancestry estimated from the marker data. The SNP sequences (markers; AIMs) 
and methods disclosed herein (BGA test) are a particularly efficient means by which to 
accomplish this task. An Analysis of Covariance (ANCOVA) test has been employed using 
the estimate of individual admixture as a conditioning variable to control for the effect of 
individual ancestry in two ways: 1) leaving out the locus under consideration 
(ANCOVA/IAE minus marker); and 2) using the complete individual ancestry estimate for 
the conditioning (ANCOVA/IAE). This method is described in detail herein. 

[0101] An alternative approach to exploiting admixture has been developed that, while 
based on earlier work, has little in common with classical LD mapping, and is more 
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analogous to linkage analysis of an experimental cross (McKeigue, Amer. J. Hum. Genet. 
63:241-251, 1998, which is incorporated herein by reference; McKeigue et al., supra, 2000). 
For this reason, the term "admixture mapping" has been proposed as more appropriate than 
"mapping by admixture linkage disequilibrium". Instead of testing for allelic associations, 
according to the present methods, the underlying variation in ancestry is modeled on 
chromosomes of mixed descent to extract all the information about linkage that is generated 
by admixture. The methods and markers disclosed are necessary and sufficient to accomplish 
this process. Advanced statistical methods are utilized to apply this approach in practice, 
though the underlying principle on which it relies to detect linkage is straightforward. 
Suppose, for instance, that a locus accounts for some of the variation in pigmentation 
between West Africans and Europeans. If individuals of mixed descent are classified 
according to whether they have 0, 1 or 2 alleles of African ancestry at this locus, then in a 
comparison of these three groups with other factors held constant, the mean pigmentation 
level will vary with the proportion of alleles at the locus that are of African ancestry. 
Controlling the analysis for parental admixture eliminates association of the trait with 
ancestry at unlinked loci and ensures that the comparison is made with other factors held 
constant. 

[0102] To infer the ancestry of the alleles at the locus from the marker genotype, the 
conditional probability of each allelic state is required given the ancestry of the allele 
(ancestry specific allele frequencies), e.g., West African or European. There is growing 
evidence that admixture mapping will be an effective means of gene identification, and it has 
been reported that, in admixed populations, strong allelic association is observed between 
linked markers spaced at substantial distances (Parra et al., supra, 1998; Parra et al., Amer. J. 
Phys. Anthropol. 114:1 8-29, 2001 ; McKeigue et al. , supra , 2000; Lautenberger et ai , supra, 
2000; Smith et aL, supra, 2001; Wilson and Goldstein, Amer. J. Hum. Genet. 67:926-935, 
2000; Pfaff et al. , supra, 2001). Given the very high levels of association observed over long 
genetic distances, it is expected that phenotypes different between parental populations 
because of some genetic factor will also show associations with linked AIMs. A phenotype 
that is well suited to apply admixture mapping is skin pigmentation. 
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[0103] Notwithstanding the power of AIMs for disease gene and forensics analysis, no 

studies have been conducted to elucidate this power. As disclosed herein, 1) SNPs or 
deletion/insertion polymorphisms (collectively referred to as AIMs) in the human genome 
that are of potential use for drug response, disease gene or forensics research were identified; 
2) biochemical and genetic test results are provided that demonstrate these AIMs can be 
useful for disease gene and forensics research; 3) the usefulness of AIMs derived from 
systematic screens of the human genome in actual drug response, disease gene or forensics 
research is demonstrated; 4) the usefulness of AIMs derived from systematic screens of the 
human genome to make an inference as to whether an individual is susceptible to acquire a 
disease, or to not respond to a drug, is demonstrated; 5) the usefulness of AIMs derived from 
systematic screens of the human genome to make an inference as to whether a crime scene 
DNA specimen was derived from an individual of, for example, an 80% European, 

10% African and 10% Asian heritage or some other ratio/mix is demonstrated; 6) the 
usefulness of AIMs derived from systematic screens of the human genome to infer the 
ancestral proportions of an individual from their DNA (e.g., whether the individual is of 
80% European, 10% African and 10% Asian heritage, or some other ratio/mix) is 
demonstrated; and 7) the usefulness of AIMs derived from systematic screens of the human 
genome to infer the ancestral proportions of a group of individuals from their DNA (for 
example, whether the group, which can be a population sample, a family, or a clinically 
defined group of persons, is of 80% European, 10% African and 10% Asian heritage, or some 
other ratio/mix) is demonstrated. 

[0104] The present results demonstrate that AIMs are useful for the applications described 
above, and the sequences exemplified herein, as well as additional AIMs identified using the 
methods disclosed herein, enable these applications. The AIMs and methods of the invention 
are useful for the study of human diseases, drug response, and physical traits and, therefore, 
provide exceptional commercial potential. For example, in this dawning era of personalized 
drug prescription and disease risk assessment, the markers and methods of the invention 
provide the tools needed to proceed in this fledgling industry. As exemplified herein, an 
individual's response to a particular medication was dependent on the degree to which that 
individual exhibited a certain population structure (i.e., was of certain ancestral heritage) in 
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addition to, but irrespective of, the person's genotype for drug target or xenobiotic 
metabolism gene sequences. As such, the compositions and methods of the invention provide 
a means to predict an individual's likelihood to respond to a particular drug. 

[ 0105 ] For example, in screen of genetic markers associated with patient response to the 

cholesterol lowering drug, Lipitor™, in terms of low-density lipoprotein (LDL) response, 
which is an indicator of favorable response, some of the most powerful markers identified for 
LDL response to Lipitor™ were gene types that are not immediately recognized as relevant 
for drug response, including, for example, TYR, OCA2, TYRP, FDPS, and HMGCR (see, 
also. Inti. Publ. No. WO 03/002721 (PCT/US02/20847), and Inti. Publ. No. WO 03/045227 
(PCT/US02/38345), each of which is incorporated herein by reference). When combined ’ 
with markers from genes that are biologically relevant for response, they augment the ability 
to make accurate inferences of response from the DNA. Each of these markers also is an 
excellent AIM, indicating that the linkage of the AIMs to drug response is likely a function of 
ancestral differences in response proclivity (see Example 5). As such, ancestral heritage can 
be predictive of favorable response to Lipitor™. This association has been observed for 
almost every type of response (n = 54) to almost every type of drug (n = 23) examined, thus 
confirming that the inference of drug response can be accomplished, at least in part, through 
the inference of ancestral proportions. As such, it appears that the genes truly relevant for 
drug response are a function, at least in part, of individual ancestry, and that the gene 
sequences relevant for drug response are statistically linked with markers that are informative 
as to ancestry (i.e., AIMs). 

[0106] Screening genomes for the true identity of genes associated with a particular trait 

such as drug responsiveness is extraordinarily expensive and time consuming. As such, the 
use of AIMs for making inferences about individual proclivity to drugs provides a significant 
short-cut for the rapid development of tests that can be used to match patients with those 
drugs most appropriate for their genetic constitution. Thus, in addition to being useful for the 
admixture mapping of disease genes, the disclosed methods and exemplified markers provide 
tools that can direct treatment protocols by clinicians. The identification of AIMs from 
publicly available human genome data, and the ability to effectively use the AIMs for the 
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development of patient-drug classification sets, admixture screening panels and forensics 
tools, was accomplished using the disclosed method, including screening the SNP database 
(see, for example, world wide web ("www") at URL "nih.ncbi.nlm.gov") for AIMs; screening 
the AIMs against a multi-ancestral panel of DNA samples to verify those that, indeed, are 
good AIMs; using the disclosed statistical and software methods for using the AIM sequences 
to make biologically relevant inferences; and recognizing that an individual's likelihood to 
respond to a drug or develop a disease can be predicted through a knowledge of their 
ancestry, which, in turn, is indicated through the individual's AIM sequences. 

[0107] Prior to the present disclosure, individual ancestry could be estimated using two 
independent methods: a Maximum Likelihood approach (Hanis et al, Amer. J. Phys. 
Anthropol. 70(4):433-441, 1986, which is incorporated herein by reference), and a Bayesian 
method implemented in the STRUCTURE program (Pritchard et al., Genetics 155:945-959, 
2000, which is incorporated herein by reference). While the Maximum Likelihood method 
and the Bayesian method provide point estimates of proportional ancestry or admixture, there 
are several deficits in these methods that are addressed by the disclosed methods. For 
example, using the disclosed algorithm (see Example 6; see, also, Table 12, containing flow 
chart exemplifying algorithm), 1) the most likely group(s) from which the individual is 
derived were estimated simultaneously with the estimate of proportional ancestry; 

2) multidimensional confidence intervals were computed and projected, thus reducing the 
complexity for presentation; 3) an approach to estimate the number of ancestors and their 
admixture proportions at each level in the past (parental, grandparental, great-grandparental, 
etc.) was developed; and 4) proportional BGA affiliation within individuals for more than 
two BGA groups at a time was derived, thus providing, for example, improved and more 
accurate forensics applications, as well as allowing for the development of classifiers for 
quantitative or continuously distributed traits (i.e., not dichotomous), the trait values of which 
are at least in part a function of BGA. 

[0108] Independent methods for classification of individuals into population groups have 
been developed (Shriver et al., supra , 1997, Frudakis et al., supra, 2002, each of which is 
incorporated herein by reference). The present methods differ from previous classification 
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methods in that they allow the simultaneous estimation of the best group within which a 
particular individual would fall, as well as the proportional assignment of the individual to 
multiple parental groups (Example 6; see, also, Table 12). Thus, where previous methods 
allowed one to make the statement that a person is much more likely to be African-American 
than European-American, the present approach allow the same statement, and also provides 
the proportional ancestry of the individual with confidence intervals (Cl); e.g., 25% (95% Cl 
15-35%); European ancestry; 75% (95% Cl 60-80%) African ancestry; and 0% (95% Cl 
0-6%) Native American ancestry. Further, the confidence intervals can be expressed in 
multidimensional space to provide a clearer representation of the ancestry measured for the 
person in question (see below; see, also, Figure 2). Though methods for constructing such a 
representation were known, the present disclosure is the first to provide for the 
representations to be presented with quantifiable confidence. 

[0109] There are clear differences in the patterns of chromosomal segment ancestry 
(PCSA) among persons with different ancestral histories (see Figure 1). A series of AIMs 
across the chromosomes can facilitate the estimation of the most likely parental combinations 
that lead to the profile of sequences observed in a given person. One example of where 
estimates of PCSA is important is in the discrimination of persons of Hispanic ancestry from 
those having primarily European ancestry with some proportion of recent Native American 
ancestry. Indeed, this is an important determination as the political and legal rights claimed 
by and provided to these two groups can depend on their ancestry. Hispanic populations such 
as Mexican-Americans (MA) have approximately 30-40% Native American ancestry, while 
the balance is European with a minor portion (5% or so) African ancestry. A person who is 
one quarter Native American will have 25% Native American ancestry and, therefore, will 
overlap with many MA persons in his level of estimated ancestry. It is expected that PCSA 
patterns will be significantly different for these two cases and may provide some of the only 
genetic evidence that would facilitate an accurate definition of the ancestry in such a case. 

As disclosed herein, PCSA can be used in ancestry studies. 

[0110] An important step in these determinations is the phasing of the ATMs along 
chromosomal segments (see Example 2, Figure 8; see, also Example 5, Figures 12 to 16). 
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Phasing AIMs along the chromosome can be accomplished by several methods, including 
1) estimation from the genotypes of the individual, 2) molecular haplotyping (e.g., allele 
specific PCR combined with genotyping), and 3) single sperm analysis (for female subjects 
the sperm of a male full sibling would provide the same profile). In addition, the disclosed 
methods allow simultaneous consideration of the two sex chromosomes (X and Y) and the 
mtDNA for ancestral inferences. AIMs are found on each of these sources, and can be 
informative for many of the questions regarding the ancestral proportions of a person and the 
population^) from which a particular person is derived. For example, Hispanic/Latino 
populations have very high (65-100%) frequencies of Native American mtDNA haplogroups, 
while showing only a minority contribution from Native American populations in autosomal 
markers. Thus, for example, a person with reputed Native American ancestry on her father's 
side, with a non-Native American mtDNA haplogroup, is more likely not Hispanic than 
partially Native American as she may suspect, than were she to have a Native American 
mtDNA haplogroup. 

[0111] Linkage disequilibrium (LD) is increasingly being used as a mapping tool for both 
fine-scale determination of gene position and for the initial localization of disease genes in 
special populations. Allelic associations are significantly non-random and correlated with 
physical distance within small (< 60 kb) genomic regions (see Jorde, Amer. J. Hum. Genet. 
66:979-988, 1995; Jorde, Genome Res. 10:1435-1444, 2000, for review), possibly reflecting 
an underlying "block structure" that characterizes many genomic regions (Reich et al., 2001; 
Daly et al., 2001). Thus, if disease alleles in a population share a recent common origin, 
nearby genetic markers with the strongest associations will be closest to the disease-causing 
locus. This approach has been important in the positional cloning of several simple 
Mendelian diseases, including the cystic fibrosis gene, the Huntington's disease gene, and the 
diastrophic dysplasia gene. 

[0112] In addition to applications in fine-mapping or positional cloning, LD can be used 
for initial disease gene mapping in homogeneous populations that have undergone recent 
increases in size or are genetically inbred. In such populations, disease alleles were probably 
present in a small number of founders, and recombination has had limited opportunity to 
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randomize associations between these alleles and alleles at linked marker loci. An analysis of 
allelic associations between affected and unaffected individuals from these populations can 
thus facilitate the localization of the disease locus. A number of Mendelian diseases have 
been mapped using this approach: several diseases in the Finish population, Hischsprung's 
disease in Mennonites, benign recurrent intrahepatic cholestasis in an isolated Dutch fishing 
community, familial persistent hyperinsulinemic hypoglycemia of infancy in a 
consanguineous group of Saudi Arabian families, and Bardet-Biedl syndrome in Bedouins. 

[0113] There has been much debate as to which populations will be best suited for LD 
mapping of complex polygenic diseases (see, e.g., Wright et al., supra, 1999; Eaves et al., 
supra, 2000; Nordborg and Tavare, supra, 2002; Kaessmann et al., supra, 2002). The extent 
of LD is a complex function of a number of genetic and evolutionary factors such as 
mutation, recombination and gene conversion rates, demographic and selective events, and 
the age of the mutation itself. Some of these factors affect the whole genome while others 
only affect particular genome regions. Additionally, variation of mutation, recombination 
and gene conversion rates throughout the genome is expected to create LD differences 
between genome regions. 

[0114] With regard to the populations to use in disease discovery efforts driven by 
LD-based methods, it has been proposed that small, isolated and inbred populations will have 
advantages over other populations, due to the lower heterogeneity and the larger extent of 
linkage disequilibrium (see, e.g., Wright et al., supra, 1999; Nordborg and Tavard, supra , 
2002; Kaessmann et al., supra, 2002). On the other side, admixed populations such as 
Hispanics and African Americans offer the advantage that linkage disequilibrium has been 
created recently due to the admixture process, and it can extend over large chromosomal 
regions, although it is extremely important to control for the genetic structure present in these 
populations in order to avoid false positives (Parra et al., supra, 1998; Lautenberger et al., 
supra, 2000; Pfaff et al., supra, 2001; Nordborg and Tavard, supra, 2002). Despite the 
increased research focus in LD based methods, however, many issues regarding LD in human 
populations remain largely unexplored. Currently the NHGRI is organizing a systematic 
project to help develop informational tools for gene identification studies by identifying the 
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common haplotypes in several populations. This "Haplotype Map Project" (HMP) will likely 
be a large-scale multicenter effort focused on finding common haplotypes in general 
population samples. The HMP will likely prove to be an important data resource for 
identifying AIMs as disclosed herein because several populations will be investigated for 
haplotype block structure, thus providing additional candidate AIMs and a basic plan for the 
fine scale LD structure in some of the parental populations. 

[ 0115 ] Admixture mapping as disclosed herein is complementary to, but distinct from, the 
HMP. First, the primary focus of the HMP is to understand the fine scale structure of 
individual genomic regions throughout the genome, whereas the present methods allows an 
understanding of the LD that results specifically from admixture. The level of LD from 
admixture is on the order of millions of bases (Mb; megabases) and tens of Mb, while the • 
HMP is focused on the level of 10's to 100's of kilobases (kb), and genomic and population 
features that affect the results from one project may not be noted in the other. Second, 
admixture mapping require accurate parental allele frequency estimates. As such, a large 
number of different African, Native American, European, and Asian populations have been 
typed (see Table 6, below), while the HMP will likely focus on one or two samples of the 
major population groups. 

[ 0116 ] Third, large samples (n = 500) of African-Americans and Hispanics have been 
typed, thus providing sufficient statistical power to test the coverage of the admixture map 
and to compare analytical methods. In addition, several representative populations from 
different regions of the country were typed so that geographical variation in ancestral 
proportions and admixture dynamics can be examined. Although some admixed populations 
\yill likely be included in the HMP, the numbers of individuals and numbers of different 
population samples being discussed are fewer than those as disclosed herein and, therefore, 
will not allow the same comparisons. For example, having a sample of 1 0 for each of 
4 ancestral groups is not adequate for the identification of sequences present preferentially in 
one or some of those groups; as disclosed herein, at least 50 individuals were tested for each 
of several tens of ancestral groups (not just four) in order to comprehensively identify these 
markers. 




WO 2004/016768 



PCMJS2003/026229 



49 

[0117] Fourth, the focus of current population variation efforts (e.g., the SNP Consortium 
allele frequency project) and, very likely, the HMP has been on East Asian, African, and 
European samples to the exclusion of Native American populations for a number of complex 
reasons. The exclusion of these populations, however, results in a deficit in an understanding 
of the genetics of the fastest growing group of US resident populations, i.e., Hispanics, who 
have a significant level of Native American ancestry (20% to 40%). With the markers and 
methods disclosed herein, the disease genetics of Hispanic populations can be examined. 
Similarly, several diverse Native American populations may represent important parental 
populations for the numerous distinct groups often grouped together as Hispanic. 

[0118] The population-based association methods disclosed herein provide several 
advantages over traditional linkage studies. Localizing disease genes by traditional genetic 
linkage methods relies on the use of related persons, either extended multigenerational 
families or pairs of related individuals. These approaches are effective and very powerful 
when investigating diseases caused by single genes. However, polygenic and multifactorial 
diseases like Type 2 diabetes, hypertension, and prostate cancer result from the interaction of 
several genes and multiple environmental influences, and are more difficult to study using 
traditional methods. The identification of genes contributing to susceptibility to common 
disease is complicated by heterogeneity. The source of the genetic heterogeneity determines 
which mapping methods are most likely to work for gene identification. Two primary types 
of genetic heterogeneity are locus heterogeneity, wherein more than one locus is affecting a 
genetic trait, and allelic heterogeneity, wherein within a particular causative locus there are 
multiple alleles that are important in altering the phenotype. Traditional linkage analysis 
using extended families is generally insensitive to allelic heterogeneity, but can be adversely 
affected by locus heterogeneity. Alternatively, LD based methods are generally adversely 
affected by allele heterogeneity, but less affected by locus heterogeneity. 

[0119] Provided there is little allelic heterogeneity, association-based approaches like 
measured genotype and transmission disequilibrium test (TDT) may be more sensitive than 
family-based LOD score or sib-pair methods. Risch and Merikangas {supra, 1996) compared 
the number of individuals needed for sib-pair studies and TDT studies, and showed that the 
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number of individuals needed to detect linkage is much smaller for TDT than for sib-pair 
studies. This is especially true when the disease locus has a small effect. For example, for a 
locus with risk ratio of 2.0 and a gene frequency of 50%, 2500 sib-pairs or 340 case/parents 
for TDT would be required. There are some examples in which the demonstration of 
association using Haplotype Relative Risk (HRR) or case/control design, or linkage with 
TDT, has preceded the demonstration of linkage with sib-pairs. A classic example is the 
association between the insulin gene and IDDM, which was demonstrated in cases and 
controls, then confirmed using TDT, but often not observed in sib-pair linkage studies 
(reviewed in Spielman et al., Amer. J. Hum. Genet. 28:317-331, 1993). Yaouanq et al. 

( Science , 1997) reported very significant (p < 10' 9 ) evidence for linkage between the HLA 
and multiple sclerosis using TDT in a series of 157 French famili es (99 simplex and 
58 multiplex). When the 58 multiplex families were analyzed alone, p values of 0.0001 
and 0.03 resulted for TDT and sib-pair methods, respectively. 

[ 0120 ] Although association studies based on candidate genes have relatively higher 
power to detect disease genes than linkage analysis in families (Risch and Merikangas, supra , 
1996), thoroughly testing all the genes in a genome with over 40,000 genes is currently not 
practical. The Haplotype Map Project may succeed in creating the informational resources 
necessary to perform gene identification based on linkage disequilibrium. However, even if 
the block structure models of the human genome can be explained by four haplotypes in each 
gene, the minimum number of SNPs and DIPs would then be 80,000 and the actual number 
likely higher. Although genotyping technologies are advancing rapidly, typing this number 
of markers in a large number of research subjects is not yet practical. Additionally, there are 
some important assumptions that are implicit in plans to identify genes using LD in large 
populations. One important difficulty using linkage disequilibrium in genome-wide 
screening is that LD decays exponentially with the recombination fraction between the 
marker and the disease locus and with the age of the disease-causing mutations. For older 
mutations that predispose to diseases, LD becomes very weak even between the disease allele 
and alleles at relatively closely spaced marker loci. 
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[0121] LD mapping has been useful in mapping of rare genetic diseases such as cystic 
fibrosis and diseases in special populations like the Finns and Bedouins, populations that 
have been subject to significant population bottlenecks, inbreeding, or founder effects. In 
these situations, LD exists because the variant allele is relatively young, as in the case of 
cystic fibrosis, or the population has reduced genetic variability, which elevates the LD 
throughout the genome. A leading model for the genetics of common disease stipulates 
predisposing alleles at a number of loci which, when present in particular combinations, 
increase an individuals risk (Greenberg, Amer. J. Hum. Genet. 52:135-143, 1993; Lander and 
Schork, Science 265:2037-2048, 1994; Risch and Merikangas, supra , 1996). If the disease is 
common, then for this model, the predisposing alleles also are expected to be at relatively 
high frequencies. However, assuming the neutral model, the frequency of an allele in a 
population is on average related to the age of the allele such that more frequent alleles are 
older than rare alleles. This fact poses a problem for the application of LD-based methods to 
identify common disease genes in populations that are not isolated or inbred since, in 
homogeneous populations, the LD is inversely related to the age of the allele and risk alleles 
for common disease are expected on average to be relatively old. 

[0122] The application of the compositions and methods of the present invention for 
admixture mapping allows for a more precise and reliable mapping of complex traits. 
Admixture mapping takes advantage of the LD created when previously isolated populations 
admix, and can circumvent these problems in mapping complex traits. It was first recognized 
that admixed populations could be useful in determining genetic linkage by Chakraborty and 
Weiss (supra, 1988). When genetically divergent populations hybridize, non-random allelic 
associations result among loci that have significant allele frequency differentials, even among 
unlinked loci. This LD quickly decays when the genetic loci in question are not located close 
together on the same chromosome. 

[0123] LD decays as a function of the recombination rate (0) between the two markers and 
the number of generations (n) since their hybridization, and can be represented as D n = (l-0) n 
Do, where D n is the linkage disequilibrium n generations after hybridization and Do is the 
initial linkage disequilibrium (Chakraborty and Weiss, supra, 1988). Given this exponential 




WO 2004/016768 



PCT/US2003/026229 



52 

relationship between the decrease in LD and genetic distance, it is possible to discriminate 
between LD in an admixed population (if the time since admixture is short) that remains high 
because markers are close together, genetically linked, and background linkage 
disequilibrium among unlinked loci. For example, after 10 generations, the linkage 
disequilibrium at unlinked loci is reduced to 0.1% of the initial level, while at loci 10 cM and 
1 cM apart, the disequilibrium due to true linkage will still be 34.9% and 90.4%, respectively, 
of the initial level. The critical parameters for effective detection of linka ge in an admi xed 
population identified are the frequency differential (8) between the parental populations and 
the number of generations since hybridization. Linkage by association analysis in admixed 
populations worked efficiently if 8 was large (not less than 0.2) and the number of 
generations since admixture small (on the order of 10 generations; Chakraborty and Weiss, 
supra, 1988). 

[0124] Stephens et al. {supra, 1994) and Briscoe et al. {supra, 1994) studied this approach 
using computer simulations (MALD) and detailed practical considerations for study design. 
Using simple models of the type of admixture that has occurred in the Americas, they 
suggested that using sample sizes of 200-300 patients, typed for 200-300 evenly spaced 
markers, each having 8 > 0.3, one would have > 95% chance of locating the causative gene. 

A consistent result from the several models studied was a primary dependence of the power 
of MALD on the allele frequency differentials of the markers used. If 8 is small, the initial 
LD will be small and difficult to distinguish from the background noise. 

[0125] Stephens et al. {supra, 1994) suggested using loci where 8 > 0.4 between the 
admixed parental populations for effective admixture mapping. They also demonstrate that 
admixture mapping is most effective in populations that hybridized between 4 and 
20 generations ago, and that incremental admixture (the slow,introgression of one population 
into another; also known as the continuous gene flow model) affects the power of admixture 
mapping, but not critically, provided there has been no new introgression from the parental 
populations in the past three generations. The disclosed admixture mapping technique can 
identify the location of disease susceptibility genes by the analysis of admixed populations 
composed of parental populations where there is a large difference in the frequency of the 
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susceptible genotype. As such, applications of admixture mapping include the study of 
Type 2 diabetes susceptibility in Pacific Island populations, hypertension obesity, and 
prostate cancer in African Americans, and Type 2 diabetes, obesity and gallbladder disease in 
Hispanic populations. 

[0126] McKeigue developed an approach to exploit admixture in mapping genes that 
builds on earlier work (McKeigue, supra, 1997; McKeigue, supra, 1998, McKeigue et al., 
supra, 2000). Although the approach is powered by the LD that is generated by admixture, it 
is more analogous to linkage analysis of an experimental cross. For this reason, the term 
"admixture mapping" was proposed. Instead of testing for allelic associations, one can model 
the underlying variation in ancestry on chromosomes of mixed descent to extract all the 
information about linkage that is generated by admixture. 

[0127] As discussed above, advanced statistical methods are required to apply this 
approach in practice. Conditioning on parental admixture eliminates association of the trait 
with ancestry at unlinked loci and ensures that the comparison is made with other factors held 
constant. In non-statistical language, a comparison is made in each individual of the 
proportion of alleles at the marker locus that are of a particular descent with the expected 
proportion given the admixture of that individual's parents. One simple way to do this is to 
use the Analysis of Covariance (ANCOVA) test (see Tables 2 and 3), though this simpler 
approach does not use all of the information available. As such, Bayesian methods also have 
been used (see Tables 2 and 3). 

[0128] To infer the ancestry of the alleles at the locus from the marker genotype, the 
ancestry-specific allele frequencies are required; i.e., the conditional probability of each 
allelic state given the ancestry of the allele (West African or European, in this example). The 
total population of alleles at any locus in the admixed population can be considered to be 
made up of two subpopulations - alleles of African ancestry and alleles of European ancestry. 
As long as the ancestry-specific allele frequencies are correctly specified for the admixed 
population under study, Bayes' theorem can be applied to invert these conditional 
probabilities and calculate the posterior distribution of ancestry at the locus (0, 1 or 2 alleles 
of African ancestry) for each individual under study. If the information conveyed by typing a 
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single marker is not sufficient to assign the ancestry of each allele at the marker locus to one 
of the two founding populations, markers can be combined in a multipoint analysis to 
estimate ancestry at adjacent loci. 

[0129J Simulation studies showed that, with enough markers, a high proportion of 
information about ancestry at each locus can be extracted even though no single marker is 
fully informative for ancestry (McKeigue, supra, 1998). Based on these simulations, panels 
of markers with Fst >0.4 at an average spacing of 2-3 cM for a total of 1,000 AIMs can be 
constructed as disclosed herein. It should be recognized that the panel of 1,000 AIMs for a 
particular population (e.g., an African-American group that is primarily West African and 
European) will often overlap with panels for other groups. In other words, it is often the case 
that AIMs selected for one level of distinction (e.g., Affican/European) are also informative 
for other distinctions (e.g., Native American/European). Table 1 lists an initially identified 
panel that includes of 32 AIMs (SEQ ID NOS:332 to 363; see, also. Example 1). Using a 
cutoff of d > 0.3, only four of these markers are restricted in informativeness to one of the 
three comparisons (African/European; African/Native American; Native 
American/European); the rest are informative for two of the comparisons, and one marker is 
informative for all three comparisons. In a further study, a panel of 71 AIMs was identified 
(SEQ ID NOS:l to 71; Table 6) that are informative as to IndoEuropean, sub-Saharan 
African, Native American, and East Indian (see Example 2). 

[0130] There is growing evidence that admixture mapping will be an effective means of 
gene identification. At least three independent groups have reported strong admixture 
linkage disequilibrium (ALD) between linked markers spaced at substantial distances (see, 
e.g., Parra et al., supra, 1998 and 2001; Pfaff et al., supra, 2001; McKeigue et al., supra, 
2000). Given the very high levels of association that have been observed over long genetic 
distances, it is expected that phenotypes dramatically different between parental populations 
because of some genetic difference will also show associations with linked AIMs. However, 
as promising as that MALD approach appears, until the present disclosure, no systematic 
screen has been reported identifying SNP based versions of the AIMs required. McKeigue 
and others have identified panels of STR AIMs for use with this approach, but the use of 
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STRs for this purpose is problematic because of the allelic complexity of STRs and the 
massive databases required in order to accurately estimate allele frequencies. Even small 
errors or faulty assumptions on the frequencies of unobserved alleles can amplify to cripple 
the statistical power of a study. 

[0131] Heterogeneity within the parental populations can have a confounding effect on 
admixture mapping studies. In the case of African-American populations, the process of 
admixture that took place in the New World involved a heterogeneous group of populations 
mainly from West-Central Africa and Europe, as well as some Native American populations. 
Regarding the European genetic contribution, the most important source populations came 
from Great Britain, Ireland, Germany and Italy. In spite of the diverse geographical areas of 
origin of the parental European populations, it is important to indicate the relative 
homogeneity of European populations from the genetic point of view (see for example 
Cavalli-Sforza et al., supra, 1994). 

[0132] With respect to the African contribution, it is well known that the African 

continent contains a tremendous amount of genetic diversity. However, only a subset of the 
African genetic diversity contributed to the formation of African-American populations. The 
majority of enslaved Africans came from West-Central Africa, approximately from Senegal 
in the North, to Angola in the South (Curtin, In The Atlantic Slave Trade-, Madison, 
University of Wisconsin Press 1969); other areas of Africa were not affected by the slave 
trade. Of the four main linguistic families present in Africa, Niger-Congo Kordofanian, 
Nilo-Saharan, Afro-Asiatic and Khoisan (Greenberg, supra, 1963), the majority of enslaved 
Africans forcefully brought to the New World were members of the Niger-Congo famil y. 
This widespread family encompasses West African languages (spoken by peoples from 
Senegal to Nigeria) and Bantu languages (dominant in Central and Southern Africa). The 
Bantu languages were dispersed throughout Africa by a "recent” expansion that took place 
about 3,000 years ago, and probably originated in West Africa (Nigeria and Cameroon; see 
Excoffier et al., Yearbook Phys. Anthropol. 30:151-194, 1987 and Cavalli-Sforza et al., 
supra, 1 994). This recent origin is reflected in the linguistic and genetic homogeneity of the 
Bantu (Excoffier et al., supra, 1987, Weber et al., supra, 2000). Thus, the available 
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historical, linguistic and genetic evidence indicate that only a subset of the diversity found in 
sub-Saharan Africa has contributed to the African-American gene pool, and that potential 
problems of heterogeneity are much less than if the diversity of the whole continent of Africa 
were represented in contemporary African-American populations. Unfortunately, the extent 
of the heterogeneity present in West and Central Africa remains largely unknown due to the 
lack of available information for the populations of this area. 

[0133] Since the extent of heterogeneity within European populations and wi thin West 
and Central Africa remains largely unknown, potential effects of heterogeneity need to be 
addressed, particularly when considering an admixture mapping approach. There are two 
levels at which heterogeneity can affect an admixture mapping effort. First, heterogeneity 
can lead to erroneous estimates of the parental frequencies for the markers used in the map, 
thus biasing the estimate of admixture. Given that the goal of admixture mapping is to infer 
linkage conditioning on parental admixture, it is important to avoid misspecification of the 
ancestry-specific allele frequencies, because this could affect the final outcome of the 
analysis. Second, heterogeneity can affect the number of loci for the phenotype being 
studied. 

[0134] The effect of heterogeneity in biasing the estimates of the genetic contributions to 

an admixed population can be reduced by selecting markers showing homogeneity wi thin the 
main parental populations (Europeans and Africans). In this way, the problem of 
contribution of different geographical areas to the parental populations is minimiz ed, 
reducing the bias in admixture estimates. This strategy was implemented in previous 
admixture studies (Parra et al., supra, 1998, 2001 ; Pfaff et al., supra, 2001), wherein 
potentially informative markers in different European and African populations were 
systematically analyzed. As an example, currently, to test for heterogeneity within Africa, 
each potentially informative marker was genotyped in samples from five African populations, 
two from Nigeria, two from Sierra Leone and one from Central African republic, and the 
markers showing significant heterogeneity were excluded from the analysis (see below). All 
of these samples came from areas that were affected by the slave trade. If desired, a sample 
from Angola, which is a region that was the source of around 40% of enslaved Africans, can 
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be incorporated, thus providing another sample of African parental populations. In addition 
to this strategy, it is important to note that there are statistical methods to test for 
misspecification of parental frequencies (see, e.g., McKeigue et al., supra, 2000). 

[0135J With respect to the potential problem of heterogeneity in the phenotypes being 
studied, it is expected that heterogeneity due to the presence of multiple genes (locus 
heterogeneity) affecting a phenotype will reduce the power of admixture mapping to detect 
significant genotypic effects, as it does with any other mapping method. Heterogeneity also 
can be due to multiple functional alleles within a particular gene (allelic heterogeneity). One 
example is MC1R, where approximately six relatively common variants lead to red hair, 
freckles, and pale skin among Native Europeans and their descendant populations. Within 
Europeans these variants are on different haplotype backgrounds, thus decreasing the power 
to detect an effect of the MC1R gene in association studies relative to the case where a single 
mutation had occurred and risen to high frequencies. However, in an admixed population 
(e.g., European/African), these variants will all be in allelic association with markers 
informative for ancestry (e.g., the MC1R marker, see Table 1) and, since they all have the 
effect of lightening the skin, their information will be compounded making the identification 
of MCI R by admixture mapping no different with six functional variants than were there 
only one functional variant unique to Europeans. So long as the effects of functional variants 
within a particular parental population are in the same direction (for example, in lowering the 
risk of disease), allelic heterogeneity will not be a serious problem in admixture mapping. 

[0136] ' The majority (80-90%) of genetic variation among human individuals is inter- 

individual; only 10-20% of the variation is due to population differences (e.g., Nei, supra, 
1987; Cavalli-Sforza et al., supra, 1994, Deka et al., supra, 1995). Most populations share 
alleles and those alleles that are most frequent in one population generally are also frequent in 
others. There are very few classical (blood group, serum protein, and immunological) or 
DNA genetic markers which are either population-specific or have large frequency 
differentials among geographically and ethnically defined populations (Roychodhury and 
Nei, supra, 1988; Cavalli-Sforza et al., supra, 1994). Despite this apparent lack of unique 
genetic markers, there are marked physical and physiological differences among human 
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populations that presumably reflect long-term adaptation to unique ecological conditions, 
random genetic drift, and sex selection. In contemporary populations, these differences are 
evident both in morphological differences between ethnic groups and in differences in 
susceptibility and resistance to disease. 

[0137] The most useful unique alleles for admixture and mapping studies are those that 
also have large differences in allele frequency among populations (Reed, supra, 1973; 
Chakraborty et al., supra, 1992; Stephens et al., supra, 1 994). The fact that they are totally 
absent from all other populations does simplify some of the statistical computations, and can 
facilitate more confident parental allele frequency estimates, but is not the primary reason for 
their utility. The designation population-specific alleles (PSAs) was initially used to describe 
genetic markers with large allele frequency differentials between populations (Shriver et al., 
supra, 1997; Parra et al., supra, 1998), but these markers are now referred to by the more 
correct and descriptive term, Ancestry Informative Markers (AIMs). For a biallelic marker, 
the frequency differential (8) is equal to px - p y , which is equal to qy - q x , where p* and py are 
the frequencies of one allele in populations X and Y and q x and qy are the frequencies of the 
other. Median 5 levels among major ethnic groups range between 1 5% and 20%, and the vast 
majority (> 95%) of arbitrarily identified biallelic genetic markers have 5 < 50% (Dean et al., 
supra, 1 994, which is incorporated herein by reference). Statistical estimates of power in an 
admixture mapping study based on using markers with an Fst > 0.4 were previously 
presented (McKeigue et al., supra, 2000). With 1,000 such markers evenly spaced across the 
genome, it was demonstrated that it was possible to have a statistical power of 80% to 
identify a disease gene that explains a 2 fold relative risk between the parental populations. 

[0138] AIMs, and their use according to a method of the invention are demonstrated in 
Examples 1 to 6 (below). In addition, allele frequency data for markers informative for 
admixture mapping in African-American and Hispanic populations have been reported 
(Smith et al., supra, 2001; Collins-Schramm et al., supra, 2002). In order to apply the 
methods of the invention to an analysis of disease predisposition or drug responsiveness, 
estimation of admixture proportions and admixture dynamic is important. Controlling for 
genetic structure in admixed populations requires knowledge of the ancestral proportions and 
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the genetic structure of these populations. Reliable estimates of admixture proportions can 
allow the informed identification of populations to consider. Since the admixture LD that is 
created during hybridization is dependent on the level of admixture, sampling should focus 
generally on those areas of the country where there has been more admixture. 

[0139] In addition to knowing the ancestral proportions, it is important to understand the 
level of population structure present in the populations under consideration. A homogeneous 
population is one in which there is no assortive mating, a panmictic population in which 
families are formed more or less by random combination and without regard to DNA 
genotypes. In most large cosmopolitan populations, homogeneity is expected and found. If, 
however, there exists stratification within the population such that individuals do not mate at 
random, the population will not be homogeneous. Admixture is one of the possible 
mechanisms introducing genetic structure in a population, and taking into account this 
genetic structure facilitates admixture mapping. 

[0140] The effect of genetic structure is considered at two levels. First, parental 
populations are evaluated to determine whether they show heterogeneity in the allele 
frequencies of the selected AIMs; heterogeneity can affect the estimate of admixture 
proportions, as discussed above. Several methods that can detect the presence of genetic 
structure. These methods can be grouped in two main categories, termed genomic control 
(GC) methods (Devlin and Roeder, supra, 1999), and structured association (SA) methods 
(Pritchard and Donelly, supra, 2001). Both methods require genotyping of a panel of 
unlinked markers to estimate and correct for the effect of genetic structure, which, as 
discussed above, may have been due to sampling effects, or due to real demographic strata in 
the sampled population. The SA method (Pritchard et al., supra, 2000; Pritchard and 
Donelly, supra, 2001) was used to test for genetic structure in the parental populations. This 
method is based on using the genotypic information provided by the unlinked markers to 
infer population structure, and has been implemented in a software program available from 
Jonathan Pritchard. In addition, to test for the presence of structure, the program estimates 
individual ancestry proportions, and, for the present studies, this Bayesian method was used 
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to complement the Maximum Likelihood Estimate method. These two methods produce 
estimates of individual ancestry that are highly correlated. 

[0141] The second source of genetic structure in admixed populations is due to the 
admixture process itself, in which newly created linkage disequilibrium is introduced in the 
admixed population. AIMS, such as those exemplified herein, are particularly sensitive 
indicators of population structure that is related to ancestral proportions. To evaluate the 
presence of population structure, samples are tested for the non-random association of alleles 
both within a locus (Hardy- Weinberg disequilibrium) and among loci (gametic 
disequilibrium), and the distribution of individual ancestry estimates also is examined (see, 
Pfaff et al., supra, 2001; Parra et al., supra, 2001). 

[0142] The history of African Americans can be traced back to 1 619, when the first 
Africans arrived at the British colonies (Jamestown), although as early as 1526 the presence 
of African slaves was reported in Spanish expeditions to what would become the United 
States (South Carolina, Georgia, Florida and New Mexico). Institutional slavery began very 
soon after, but it was not until the beginning of the 1 8th century that the importation of slaves 
reached increased rates, in parallel with the demand for workers to cultivate the tobacco, 
indigo, and rice plantations in the southern colonies; peaks occurred in the decade from 
1790-1 800 and the first years of the 19th century. In 1808, slave trade became illegal but 
continued at a low rate for several more decades. Different estimates have been offered on 
the total number of slaves brought into the United States with generally accepted numbers 
ranging between 380,000 and 570,000. 

[0143] Although it is difficult to precisely determine the ethnic origins of the African 
slaves, information from shipping lists has provided an approximate picture of their 
geographic provenance. The slave trade affected a very wide area of Western and Western- 
Central Africa, mainly the coastline between the present day countries of Senegal in the 
North and Angola in the South. The most important regions were Senegambia (Gambia and 
Senegal), Sierra Leone (Guinea and Siena Leone), Windward Coast (Ivory Coast and 
Liberia), Gold Coast (Ghana), Bight of Benin (From the Volta river to the Benin river), 

Bight of Biafra (East of Benin river to Gabon), and Angola (Southwest Africa, including part 
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of Gabon, Congo and Angola). Curtin (supra, 1969) has offered, based on data on the 
English trade of the 18th century (the peak of the Atlantic slave trade), es timates of the 
proportional contribution by areas, showing that Angola and Bight of Biafra were the regions 
contributing the highest numbers of slaves imported into the North American mainland 
(around 25% each). However, there were significant differences in ethnic origin depending 
on the port of entry in the United States, and the figures for the Colonies of Virginia and 
South Carolina differed considerably. 

[0144] The history of African Americans has been marked not only by the forced 

migration from Africa, but also by the admixture with the other ethnic groups they met when 
they arrived in North America, including with Europeans and Native Americans. However, 
few historical records address the issue of admixture. Additionally, there have been 
important factors that, in the time since the abolition of slavery until the present, have 
configured the present African-American population. Of special interest is the' pattern of 
migration of African Americans within the United States over the past 150 years. In this 
sense, the redistribution of African Americans in the Southern States during the 19th century, 
and the Great Migration from the rural South to the urban areas in the North be ginnin g after 
World War I are of particular relevance, and have had an enormous impact in defining the 
present distribution of the African-American population in the US (Johnson and Campbell, In 
Black Migration in American: A Social Demographic History, Duke University Press, 
Durham NC 1981). 

[0145] With respect to Hispanics, the term "Hispanic" was coined mainly for 
governmental demographic purposes, and is generally employed to identify persons of Latin 
American origin or descent, living in the United States. Although this definition lumps 
together people with very different historical, cultural and linguistic backgrounds, this 
classification has been widely used. Even though Central America, the Caribbean, and South 
America have been for centuries under the domination of the Iberian imperial powers (Spain 
and Portugal), they have had quite different regional histories, both before and after the 
Colonial period. Populations from four continents, North and South America, Europe, and 
Africa, have contributed to the formation of contemporary Hispanic populations. The 
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anthropological background of the main three Hispanic groups currently living in the United 
States - Mexican Americans, Puerto Ricans and Cuban Americans, which together makeup 
more than 80% of the total US Hispanic population - is considered here. 

[0146] Mexican Americans show the highest Amerindian contribution of the three 
aforementioned groups. Soon after the Spanish conquest of Mexico, at the beginning of the 
1 6th century, intermixture of the Spanish men with Amerindian women resulted in an 
increasingly important mixed population (Mestizos), and this racial mixing continued through 
the three centuries of Spanish domination in "New Spain", configuring both biologically and 
culturally the Mexican population. The majority of estimates have indicated an Amerindian 
component in Mexican Americans ranging between 30% and 40% (Hanis et al., supra, 1986; 
Long et al., 1991; Hanis et al., Diabetes Care 14:618-627, 1991; Merriwether et al., Amer. J. 
Phys. Anthrop. 102:153-159, 1997). It is interesting to point out, as well, that some studies 
have shown differences in the amount of Amerindian ancestry depending on socioeconomic 
status (Chakraborty et al., Genet. Epidemiol. 3:435-454, 1986; Mitchell et al., Ethnicity and 
Disease 3:22-31, 1992). There was also a substantial African presence in the Mexico 
territory during the Spanish rule. Curtin {supra, 1969) has estimated the total number of 
Africans imported into Mexico during the entire period of Slave trade to be around 200,000. 
Their contribution to the Mexican gene pool, however, has been estimated to be much lower 
than the European and Amerindian contribution, ranging from zero to 10% (see, e.g., Hanis et 
al supra, 1991). 

[0147] In the Caribbean Colonies (Cuba and Puerto Rico), the situation was very different 
from the Mainland. The Native American population was far smaller, and was decimated by 
slavery and disease very soon after the first contact with the Europeans. Nevertheless, the 
rate of admixture during the initial phases of the colonization was high enough to result in an 
appreciable genetic contribution (about 18%) from the Arawaks and Caribs, the original 
inhabitants of the Hispanic Caribbean (Hanis et al., supra, 1991). Another distinctive feature 
of this region is a significant African influence, which is also reflected in many aspects of the 
present societies of countries like Cuba, Puerto Rico, and the Dominican Republic. African 
slaves were imported to work in the sugar plantations in large numbers, even outnumbering 
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the population of European origin (Kanellos and Perez, In Chronology of Hispanic- American 
history: from pre-Columbian times to the present ; New York, Gale Research 1995). 
Accordingly, the percentage of African genetic contribution in contemporary Cubans (20%) 
and Puerto Ricans (37%) is significantly higher than in other Hispanic populations (Hanis et 
al., supra, 1991). 

[0148] It is clear that race is a complex concept and, in general usage, reflects both a 
cultural and biological feature of a person or group of people. Given the fact that physical 
differences between populations are often accompanied by cultural differences, it has been 
difficult to separate these two elements. There has been a movement in several fields of 
science to oversimplify the issue declaring that race is merely a social construct. While this 
often can be true, depending on what aspect of variation between people is being considered, 
it can be false for many particular instances of differences between the populations of the 
world. One clear example of a biological difference is skin color. Culture or environment 
has almost no effect on the level of pigmentation in a person's skin. Yet there are dramatic 
differences across populations. Pigmentation is, of course, only skin deep and is quite simple 
in light of the complex environments in which we live and how these affect individual and 
group quality of life. 

[0149] The human species is relatively young and, as a species, most likely originated in 
east Africa 100,000 years ago, and diverged as groups to settle the globe (Cavalli-Sforza and 
Cavalli-Sforza, In The Great Human Diasporas. The History of Diversity and Evolution 
(Perseus Books, Cambridge MA 1995). During these migrations, and in the time since, there 
has been some degree of independent evolution of the populations that settled the various 
continents of the world. The simplest evidence of this evolution is seen in the differences in 
allele frequencies at genetic markers. Generally, alleles that are found in one population are 
also found in all populations, and the alleles that are the most common in one population also 
are common in others. These similarities between populations highlight the recent common 
origin of all populations. However, there are examples of genetic markers that are different 
between populations and, as disclosed herein, these markers, AIMs, can be used to estimate 
the Ancestral origins of a person or population. 
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[0150] The present invention provides methods of estimating proportional ancestry of at 
least two ancestral groups of a test individual and, in particular, provides a confidence level 
with respect to the proportional ancestry. A method of the invention can be performed by 
contacting a sample, which includes nucleic acid molecules of the test individual, with 
hybridizing oligonucleotides that can detect nucleotide occurrences of SNPs of a panel of at 
least about ten AMs that are indicative of BGA for each ancestral group examined, wherein 
the contacting is under conditions suitable for detecting the nucleotide occurrences of the 
AIMs of the test individual by the hybridizing oligonucleotides; and identifying, with a 
predetermined level of confidence, a population structure that correlates with the nucleotide 
occurrences of the AIMs of each of the ancestral groups examined, wherein the population 
structure is indicative of proportional ancestry. 

[0151] The term "biogeographical ancestry" or "BGA" is used herein to describe the 
biological or genetic component of race. BGA is a simple and objective description of the 
ancestral origins of a person, in terms of the major population groups (e.g., Native American, 
East Asian, Indo-European, and sub-Saharan African). BGA estimates can represent the 
mixed nature of many people and populations today. In many countries, including the United 
States, there has been extensive mixing among populations that initially had been separate. 
The term "admixture" is used herein to refer to such population mixin g. In this respect, BGA 
estimates can be understood as individual admixture proportions, which take the form of a 
series of percentages that add to 100%. For example, a person can have 75% Indo-European, 
15% African, and 10% Native American ancestry, or can have 100% Indo-European 
ancestry, or the like. 

[0152] The proportional ancestry estimated according to a method of the invention can be 
a proportion of any ancestral group, including, for example, a proportion of sub-Saharan 
African, Native American, IndoEuropean, East Asian, Middle Eastern, or Pacific Islander 
ancestral group, and generally is a combination of two or more of such ancestral groups. 

Thus, the proportional ancestry of a test individual can include proportional affiliation among 
the sub-Saharan African and IndoEuropean ancestral groups (e.g., 80% sub-Saharan African 
and 20% IndoEuropean; or 60% sub-Saharan African, 20% IndoEuropean, and 20% of a third 
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ancestral group); or can include proportional affiliation among the Native American and 
IndoEuropean ancestral groups; East Asian and Native American ancestral groups; 
IndoEuropean and East Asian ancestral groups; and the like. 

[0153] A panel of AIMs useful for estimating proportional ancestry of an individual can 
include AIMs as set forth in SEQ ID NOS.l to 331, for example, AIMs as set forth in SEQ 
ID NOS:l to 71, which can be useful for determining proportional ancestries including 
IndoEuropean, sub-Saharan African, East Asian, and Native American; or AIMs as set forth 
in SEQ ID NOS:7, 21, 23, 27, 45, 54, 59, 63, and 72 to 152, which can be useful for 
determining proportional ancestry of East Asians and sub-Saharan Africans; or in SEQ ID 
NOS:3, 8, 9, 11, 12, 33, 40, 59, 63, and 153 to 239, which can be useful for determining 
proportional ancestry of East Asians and IndoEuropeans; or in SEQ ID NOS:l, 8, 1 1, 21, 24, 
40, 172, and 240 to 33 1 , which can be useful for determining proportional ancestry of 
IndoEuropeans and sub-Saharan Africans;. 

[0154] An estimate can be made, for example, of an individual's proportional ancestry 
with respect to three ancestral groups. In this method, identifying a population structure 
within an individual that correlates with the nucleotide occurrences of the AIMs of the test 
individual can be practiced by performing a likelihood determination for affiliation with each 
of a sub-Saharan African ancestral group, a Native American ancestral group, an 
IndoEuropean ancestral group, and an East Asian ancestral group; thereafter selecting three 
ancestral groups having a greatest likelihood value for the individual; determining a 
likelihood of all possible proportional affiliations among the three ancestral groups having the 
greatest likelihood value, whereby a population structure or proportional affiliation that 
correlates with the nucleotide occurrences of the AIMs of the test individual is identified; and 
identifying a single proportional combination of maximum likelihood. Alternatively, 
identifying a population structure that correlates with the nucleotide occurrences of the AIMs 
can be practiced by performing six two-way (binary) comparisons comprising likelihood 
determinations for affiliation of each group compared to each other group; thereafter 
selecting three ancestral groups having a greatest likelihood value across all comparisons; 
determining a likelihood of all possible proportional affiliations among the three ancestral 
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groups having the greatest likelihood value, whereby a population structure or proportional 
affiliation that correlates with the nucleotide occurrences of the AIMs of the test individual is 
identified; and identifying a single proportional combination of maximum likelihood. Such a 
methodology works as well for individuals of three-way admixture as individuals that are 
100% affiliated with a single group. 

[0155] An estimate of an individual's proportional ancestry that includes proportions of 
three ancestral groups also can be made by performing three three-way comparisons among 
the groups; determining a likelihood of all possible proportional affiliations among the three 
ancestral groups having the greatest likelihood value, whereby a population structure or 
proportional affiliation that correlates with the nucleotide occurrences of the AIMs of the test 
individual is identified; and identifying a single proportional combination of maximum 
likelihood. An advantage of the present methods is that a graphical representation of the 
comparison of the three ancestral groups can be generated, wherein the graphical 
representation comprises a triangle with each ancestral group independently represented by a 
vertex of the triangle, and wherein the maximum likelihood value of proportional affiliation 
for an individual comprises a point within the triangle (see Figures 2 and 3). If desired, the 
graphical representation can further include a confidence contour that indicates a level of 
confidence associated with estimating the proportional ancestry. 

[0156] An estimate of an individual's proportional ancestry also can be made where the 

proportional ancestry includes proportions of four ancestral groups. In various aspects of this 
method, identifying a population structure that correlates with the nucleotide occurrences of 
the AIMs of the test individual is practiced by performing six two-way comparisons, or by 
performing three three-way comparisons, or by performing one four-way comparison among 
the groups; determining a likelihood of all possible proportional affiliations among the four 
ancestral groups having the greatest likelihood value, whereby a population structure or 
proportional affiliation that correlates with the nucleotide occurrences of the AIMs of the test 
individual is identified; and identifying a single proportional combination of maximum 
likelihood. If desired, the method can further include generating a graphical representation of 
the comparison of the three ancestral groups, wherein the graphical representation comprises 
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a pyramid with each ancestral group independently represented by a vertex of the pyramid, 
and wherein the maximum likelihood value of proportional affiliation for an individual 
comprises a point within the pyramid. If desired, the graphical representation can further 
include a confidence contour comprising a sphere around the point, wherein the sphere 
indicates a level of confidence associated with estimating the proportional ancestry. 

[0157] As disclosed herein, such methods are useful, for example, as a forensic tool. The 
present methods provide substantially greater information for forensics because, using a DNA 
sample obtained at a crime scene, the methods can provide an investigator with prospective 
information as to the likelihood of an individuals ancestry, as well as hair , skin and eye 
pigmentation. In comparison, present DNA methods only allow provide retrospective 
information because they require that a DNA sample from a crime scene be compared with 
DNA samples contained in a database or taken from specific individuals. Thus, while the 
latter methods can provide confirmation that a suspect is likely the perpetrator of a crime, 
they provide no useful information until the suspect is apprehended, except in cases where 
the suspect's DNA sample already has been entered into a database. 

[0158] The methods of estimating proportional ancestry of a test individual as disclosed 
herein also provide a tool that can supplement genealogical information, which generally is 
based on relationships established using geopolitical information (see Example 3). For 
example, the present methods provide information that can be used to generate an ancestral 
map of the world, wherein locations of populations having a proportional ancestry 
corresponding to the proportional ancestry of the test individual are indicated on the ancestral 
map. As such, the method can further include overlaying the ancestral map with a 
genealogical map, wherein the genealogical map indicates locations of populations having 
geopolitical relevance with respect to the test individual, and statistically combining the 
information of the ancestral map and genealogical map to obtain a most likely estimate of 
family history of the test individual. 

[0159] Identifying a population structure that correlates with the nucleotide occurrences of 
the AIMs, according to a method of the invention, can be performed by comparing the 
nucleotide occurrences of the AIMs of the test individual with known proportional ancestries 
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corresponding to nucleotide occurrences of AIMs indicative of BGA. The known 
proportional ancestries corresponding to nucleotide occurrences of AIMs indicative of BGA 
can be contained in a table or other list, and the nucleotide occurrences of the test individual 
can be compared to the table or list visually, or can be contained database, and the 
comparison can be made electronically, for example, using a computer. A particularly useful 
application of a method of the invention involves associating known proportional ancestries 
corresponding to nucleotide occurrences of AIMs indicative of BGA of individuals, with a 
photograph of a person from whom the known proportional ancestry was determined, thus 
providing a means to further infer physical characteristics of a test individual. In one aspect, 
the photograph is a digital photograph, which comprises digital information that can be 
contained in a database that can further contain a plurality of such digital information of 
digital photographs, each of which is associated with a known proportional ancestry 
corresponding to nucleotide occurrences of AIMs indicative of BGA of the person in the 
photographs. 

[0160] A method of the invention can further include identifying a photograph of a person 
having a proportional ancestry corresponding to the proportional ancestry of the test 
individual. Such identifying can be done by manually looking through one or more files of 
photographs, wherein the photographs are organized, for example, according to the 
nucleotide occurrences of AIMs of the person in the photograph. Identifying the photograph 
also can be performed by scanning a database comprising a plurality of files, each file 
containing digital information corresponding to a digital photograph of a person having a 
known proportional ancestry, and identifying at least one photograph of a person having 
nucleotide occurrences of AIMs indicative of BGA that correspond to the nucleotide 
occurrences of AIMs indicative of BGA of the test individual. 

[0161] According to the present invention, BGA can be determined using any of several 
variations of the disclosed BGA test, including three BGA tests referred to as the 
ANCESTRYbyDNA™ 1 .0 test, the ANCESTRYbyDNA™ 2.0 test , and the 
ANCESTRYbyDNA™ 3.0 test (DNAPrint genomics, Inc.; Sarasota FL), which utilize 
selected panel of Ancestry Informative Markers (AIMs) that have been characterized in a 
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large number of well-defined population samples. The AIMs are selected on the basis of a 
showing of substantial differences in frequency between population groups and, as such, 
provide information as to the origin of a particular person whose ancestry is otherwise 
unknown. For example, the Duffy Null allele (FY*0) is very common (approaching fixation 
or an allele frequency of 100%) in all sub-Saharan African populations, but is not found 

outside of Africa. Thus, a person with this allele is very likely to have some level of African 

ancestry. Upon analysis of AIMs in a DNA sample from a person of unknown origin, a 
likelihood (or probability) can be determined that the person is derived from particular 
parental populations by calculating all of the possible mixes of parental populations. The 
population (or combination of populations) where the likelihood is the highest is taken as the 
best estimate of the ancestral proportions of the person; confidence intervals on these point 
estimates of ancestral proportions are also calculated. 

[0162] An obj ective assessment of the biological component of human ancestry provides 
important knowledge about the person whose DNA is examined. For example, an analysis of 
the biological component of ancestry can elucidate health disparities by identifying, for 
example, genetic contributions to the higher rates of hypertension and diabetes in African 
Americans, or the higher rates of dementia in European Americans. Estimates of BGA also 
can help connect individuals separated by adoption or some other event with their ancestral 
populations. And even if a person is not particularly motivated to reconnect with ancestors, 
he or she can uncover the past of their family, for example, to verify family legends or 
identify forgotten roots. Because the disclosed method is based on an analysis of DNA, it 
provides a personal demographics tool, which, unlike a census, can provide highly accurate 
demographics data. 

[0163] There are several commercially available tests that analyze mitochondrial DNA 
(mtDNA) or Y chromosome markers, and have been promoted as a means of learning one’s 
ancestral origins. Although these tests can provide information regarding the provenance of 
some of a person's ancestors, the tests are very limited. For example, one generation ago a 
person has two ancestors, one mother and one father; five generations ago, a person has 32 
ancestors; while 10 generations ago, a person has 1024 ancestors. Ten generations is roughly 




WO 2004/016768 



PCT/US2003/026229 



70 

250 years and well within the time frame of genealogical interest, especially when 
considering, for example, the settlement of North America Because the mtDNA and Y 
chromosome tests only look at a small portion of the genome (the matrilineal and patrilineal 
lineages, respectively), they can only provide information relating to a very small proportion 
of a person's ancestors. The BGA test of the invention utilizes sequences throughout a 
person's genome and, therefore, can provide information about a greater number of ancestors. 

[0164] Accordingly, the present invention provides a method of estimating, with a 
predetermined level of confidence, proportional ancestry of at least two ancestral groups of a 
test individual. Such a method, referred to as a "biogeographical ancestry test" or "BGA 
test", can be performed, for example, by contacting a sample, which includes nucleic acid 
molecules of the test individual, with hybridizing oligonucleotides that can detect nucleotide 
occurrences of SNPs of a panel of at least about ten AJMs that are indicative of BGA for each 
ancestral group examined, wherein the contacting is under conditions suitable for detecting 
the nucleotide occurrences of the AIMs of the test individual by the hybri dising 
oligonucleotides; and identifying, with a predetermined level of confidence, a population 
structure that correlates with the nucleotide occurrences of the AIMs of each of the ancestral 
groups examined, wherein the population structure is indicative of proportional ancestry. 

[0165] As used herein, the term "proportional ancestry" refers to the percent contribution 
of each (if more than one) ancestral group to which an individual belongs. The proportional 
ancestry estimated according to a method of the invention can be a proportion of any 
ancestral group, including, for example, a proportion of sub-Saharan African, Native 
American, IndoEuropean, East Asian, Middle Eastern, or Pacific Islander ancestral group, 
and generally is a combination of two or more of such ancestral groups. Thus, the 
proportional ancestry of a test individual can include proportions of sub-Saharan African and 
IndoEuropean ancestral groups (e.g., 80% sub-Saharan African and 20% IndoEuropean; or 
60% sub-Saharan African, 20% IndoEuropean, and 20% of a third ancestral group); or can 
include proportions of Native American and IndoEuropean ancestral groups; East Asian and 
Native American ancestral groups; IndoEuropean and East Asian ancestral groups; and the 
like. Similarly, the proportional ancestry can include proportions of Native American, East 




WO 2004/016768 



PCT/US2003/026229 



71 

Asian, and IndoEuropean ancestral groups; sub-Saharan African, Native American, and 
IndoEuropean ancestral groups; sub-Saharan African, Native American, and East Asian 
ancestral groups; and the like. 

[0166] A panel of AIMs useful for estimating proportional ancestry of an individual can 
include AIMs as set forth in SEQ ID NOS: 1 to 33 1, for example, AIMs as set forth in SEQ 
ID NOS:l to 71, which can be useful for determining proportional ancestries including 
IndoEuropean, sub-Saharan African, East Asian, and Native American. For example, the 
AIMs as set forth in SEQ ID NOS:7, 21, 23, 27, 45, 54, 59, 63, and 72 to 152 can be useful 
for determining proportional ancestry of East Asians and sub-Saharan Africans; the AIMs as 
set forth in SEQ ID NOS:3, 8, 9, 11, 12, 33, 40, 59, 63, and 153 to 239 can be useful for 
determining proportional ancestry of East Asians and IndoEuropeans; and the AIMs as set 
forth in SEQ ID NOS:l, 8, 11, 21, 24, 40, 172, and 240 to 331 can be useful for deter minin g 
proportional ancestry of IndoEuropeans and sub-Saharan Africans. 

[0167] The ANCESTRYbyDNA™ 1 .0 test (DNAPrint genomics, Inc.) is a first version of 
the BGA test that was specifically designed to provide information on the proportions- of 
ancestry at the continental level. As such, the ANCESTRYbyDNA™ 1.0 test allowed 
information to be obtained as to levels of Native American, European, and African ancestry, 
as three component groups. The ANCESTRYbyDNA™ 2.0 test, in comparison, provides 
information on the proportions of ancestry at the continental level for most continents, 
including Native American, Indo-European (includes European, Middle Eastern and South 
Asian groups such as Indians), African, and East Asian (which includes Pacific Islanders, and 
can distinguish ancestries within Asia and the Pacific Rim. The ANCESTRYbyDNA™ 3.0 
test can further define the levels of ancestry within continents, for example, by distinguishing 
Japanese from Chinese, or Northern European from Middle Eastern, thus providing greater 
insight into where within a particular continent a person's ancestors were derived. 

[0168] For the ANCESTRYbyDNA™ 2.0 test, a logical grouping into four BGA 
delineations was made, wherein South Asian, Middle Eastern and European are grouped into 
a single group called IndoEuropean (see Example 2). This grouping was based on 
anthropological evidence and cultural connections between these groups (e.g., their languages 
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are derived from a common base). The results disclosed herein demonstrate that these groups 
are far more similar to one another in genetic sequence content than to other groups. The 
ANCESTRYbyDNA™ 2.0 test also performs more accurately when Pacific Islanders are 
grouped with East Asians. As such, the four groupings used in the ANCESTRYbyDNA™ 
2.0 test include 1) Native American (i.e., those who migrated to inhabit South and North 
America); 2) IndoEuropean (Europeans, Middle Easterners and South Asians such as 
Indians; 3) East Asians (Japanese, Chinese, Koreans, Pacific Islanders); and 4) Africans 
(sub-Saharan). The ANCESTRYbyDNA™ 3.0 test can further distinguish between South 
Asian and European, and between Pacific Islander and East Asian, thus providing 
6 proportions (Native American, European, African, South Asian, East Asian and Pacific 
Islander), although the confidence intervals are larger than those obtained with the 
ANCESTRYbyDNA™ 2.0 test. Further improvement to the tests are provided, wherein the 
confidence intervals are reduced. Confidence intervals around a point estimate can be 
reduced, thus increasing the accuracy of the test, by analyzing a complementary panel, 
thereby improving the confidence intervals by about 50%. 

[0169] The algorithm used to determine the ancestral proportions was developed based on 
the idea that it is possible to use certain statistical methods to make an inference of the 
proportionality of ancestry in an individual sample based on their sequence (see Example 6; 
see, also, Table 12). The method of making this inference using the present algorithm is 
similar to those of others, wherein, if the frequency of an allele in a population is known, and 
this frequency is significantly different from population to population, a "Maximum 
Likelihood Estimation" (MLE) can be used to determine the probability that a person with the 
allele belongs to one of the groups. Expanded to include multiple alleles from multiple 
genetic loci and multiple populations, the process is the same. By way of s imp lification, 
Bayes' theorem states that the probability of an event given a circumstance (called a posterior 
probability) is a function of the frequency of the circumstance given the event (a conditional 
probability) and the frequency of the event itself (the prior probability). By determining the 
probability of the event given the circumstance for a wide range of possible events, that with 
the highest probability can be selected, thus obt ainin g the MLE for the probability. 
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[01 70] In the present algorithm, the event is a proportionality of ancestry, and the 

circumstance is the genotype of the individual. If the minor allele frequency for 10 SNPs in 
2 populations of human beings is known, and the sequence of a person at each of the 10 SNPs 
is known, a simple binary classification into one of the two groups can be made by choosing 
that for which the conditional probability is highest. This would offer little improvement 
over current methods for determining the BGA from a DNA sample. What is provided by the 
present invention is the ability to obtain the proportionality of ancestry for more complex and 
realistic scenarios of ancestry. There are many possible combinations such as 99%, African, 
1% European, 0 Native American, 0% East Asian; or 98% African, 1% European, 1% Native 
American, 0% East Asian; and the like. The posterior probabilities for each of the thousands 
of possibilities are not the.same for any particular individual, given his or her multilocus 
genotype (i.e., genotypes of many AIMs), and, in fact, there is one that has the highest 
posterior probability or likelihood for each genotype. It is this combination that the present 
algorithm selects (i.e., the MLE). 

[0171] Previous methods have been limited in that the confidence of the estimate was not 
known. The present algorithm addresses this limitation by plotting the MLE graphically, 
including plotting the confidence regions around the MLE such that a level of confidence can 
be ascertained (see Figures 2 and 3). Further, the algorithm (i.e., the software code) that 
performs the MLE calculation operates in an unusually efficient manner. The triangle plot 
provided by the algorithm is an original method to graphically represent the MLE 
calculations and their confidence intervals. To read a triangle plot (see below), a 
perpendicular line is dropped from each vertex (triangle point) of the triangle to the opposite 
edge (base) of the triangle (see Figure 2A). In this figure, the circle represents the MLE, and 
a line has been dropped from the Native American (NAM) vertex to the line below; the line 
serves as a scale for the percentage of Native American ancestry, from 0% at the base to 
100% at the vertex (or tip). Projecting the circle on this line can be analogized to holding a 
flashlight to the right of the triangle at the same level as the circle and observing the shadow 
the circle makes on the line. Where this shadow falls on the line indicates the percentage of 
Native American ancestry. In this example, the individual is about 15% Native American, as 
indicated by the hash mark on the line. 
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[0172] The results provided using the disclosed method provide a statistical estimate of 
BGA admixture for an individual (the Maximum Likelihood Estimate (MLE)), which is 
indicated as a point on a triangle plot to represent the proportions of the most relevant three 
groups for the individual. While the MLE is the most likely estimate, the true value for the 
individual can be a different set of proportions. A triangle plot with calculated and plotted 
estimates that are 2 times, 5 times and 10 times less likely than the MLE is exemplified. The 
first contour around the MLE delimits the space within which the es timat es are up to 2 times 
less likely, with those positions near the line reflecting values close to 2 and those near the 
MLE closer to 1 , the second contour around the MLE delimits the space wi thin which the 
estimates are 5 times less likely in the same graded fashion proceeding from the first contour 
line to the second contour line. The third contour delimits the space within which the 
estimates are from 5 fold (near the second contour line) up tolO times less likely (near the 
third contour line). The greater the number of DNA positions read, the closer these contour 
lines approach the MLE point. On the triangle plot, the likelihood (probability) that the true 
value is represented by a different point than the MLE increases until the MLE is met, where 
the probability is maximum (i.e., the Maximum Likelihood Estimate; MLE). The test can be 
performed so that the contour lines are very close to the MLE by sequencing a very large 
collection of markers. However, to keep the test affordable and efficient, the survey can be 
limited to a desired number of markers (e.g., 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, or 
more) that is sufficient to determine the most likely proportions with good confidence. In this 
respect, a variety of different panels of 100 SNP markers have been examined, a panel of 71 
AMs has been used in a number of studies, and a panel of 175 AIMs is being examined such 
that very confidence is achieved. 

[0173] The BGA test of the invention has been validated by determining the frequency of 
DNA sequence variants in various human populations. In addition, the test has been 
evaluated using a large number of people from a wide range of ancestral groups, and the 
estimates have corresponded well to what is known from anthropological and historical data. 
For example, Hispanics are known to have arisen as an ethnic group from the blending of 
colonial Europeans with Native Americans, and the hundreds of Hispanics examined using 
the BGA test aligned with these two groups almost exclusively. As another example, though 
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Nigerians plot as of almost pure African BGA, African Americans plot more as a mixture 
between this group and Europeans, which is what would be expected from knowledge about 
the admixture between Africans and Europeans in the United States. 

[0174] The method also was validated through pedigree challenge (see Example 1); i.e., 
when the BGA is determined from a mother and father, that of their children should plot 
somewhere between the two. Numerous family pedigrees have been examined using the test, 
and the ancestral proportions of offspring have always plotted between those of the child's 
parents. When the MLE estimates are tested objectively (blindly), they prove to be excellent 
estimates of ancestral proportions. For example, the data for a European American man 
whose mother is European mix and father is mostly Greek, showed the man to be of 85% 
European ancestry, but also of 1 5% Native American ancestry (Example 1). In fact, his 
paternal great-grandmother was full-blood Cherokee, thus confirming the result of the test 
(based on the laws of genetics, the man would be predicted to have a 12% Native American 
Ancestry if his great-grandmother was 100% Native American and none of his other relatives 
were of Native American ancestry). Further, the man's wife is Mexican, and she was 
determined to be of mostly Native American, but with some Native American and African 
heritage. This was also expected based on what is known from anthropological origin of 
Hispanics, who derived from the union of Spanish explorers with Native Americans in 
Colonial Caribbean and Latin America. Each of the three children of the man and woman 
plotted roughly half way between both parents, as expected. None of the children showed 
any Asian or Pacific Islander ancestry, which would have been impossible (assuming an 
accurate test) because none of the parents showed any significant Asian or Pacific Islander 
heritage, and none of the children were found to have more African ancestry than their 
mother, which would also be impossible given the fact that the father has virtually none. 

Thus, the results of the children were consistent with those of the parents, and the MLE 
values were accurate estimates when tested against what was known from biographical data. 

[0175] The genotypes (nucleotide letters) determined to date are quite accurate. Because 
the latest genetic reading equipment available is used, an accuracy greater than 99% accuracy 
is routinely achieved for each site. If an accurate value was not obtained for a particular site 
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in a particular sample, an "FL" is indicated, instead of the genotype letters for that site. 

Having a few FL's generally does not prevent a good ancestry estimate. A sample can 
produce an FL for a site because, for example, a small region of the chromosome around this 
site is missing or is of different sequence character than for most (this result is not uncommon 
given the highly variable nature of the chromosomal positions we measure); or because not 
enough DNA was obtained from the buccal swab used to collect a DNA sample. 

[0176] The genome was scanned for a useful panel of BGA AIMs and the best 71 AIMs 
were identified using the maximum likelihood algorithm to measure BGA admixture 
proportions were selected (Table 6). Using these AIMS, majority BGA affiliations were 
measurable in a manner consistent with self-held notions on BGA, and BGA admixture 
proportions were measurable with significantly improved precision, accuracy and reliability 
compared to previously described methods for the inference of race (see Example 2; see, also, 
Example 1, using 32 marker test). This test can be used during study design to help reduce or 
eliminate the insidious effects that cryptic or micro population structure imposes. The test 
also can be useful for forensic scientists who currently use imprecise and sometimes 
inaccurate means by which to infer race from crime scene DNA. 

[0177] The present invention also provides articles of manufacture, including one or a 
plurality of photographs, each photograph being of a person having a known proportional 
ancestry corresponding to a population structure comprising nucleotide occurrences of AIMs 
indicative of BGA, the known proportional ancestry being associated with the photograph in 
the article. An article of manufacture of the invention (i.e., a photograph and the proportional 
ancestry information) can be contained in one or more files (e.g., the photograph and 
information in one file, or the photograph in one file and the information in a second file, 
which is or can be linked to the photograph). If desired, more than one photograph of an 
individual having a known proportional ancestry can be contained in the same or a linked file, 
for example, photographs containing different profiles of the individual or photographs of the 
individual at various ages. 

[0178] Similarly, a plurality of the articles (i.e., photographs and proportional ancestry 

information) can be contained in a file, for example, a file containing a plurality of 
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photographs of different persons, wherein the some or all of the persons have the same or 
different known proportional ancestries that correspond to a population structure comprising 
nucleotide occurrences of AIMs indicative of BGA. Such a plurality of articles also can be 
contained in different files, including, for example, a plurality of files, each containing one 
photograph and information regarding the known proportional ancestry of the individual in 
the photograph, or each containing two or more photographs of different individuals, each of 
which contains the same known proportional ancestry, or each con tainin g two or more 
photographs of different individuals, some or all of which have a different proportional 
ancestry as compared to another individual whose photograph is contained in the file. 
Accordingly, a plurality of such articles is provided, as is a plurality of files, each file of 
which can contain one or more articles, i.e., photographs, which can be of one or more 
persons having the same or different known proportional ancestries that correspond to a 
population structure comprising nucleotide occurrences of AIMs indicative of BGA; and the 
plurality of files can contain files, each of which contains one or more photographs of one or 
more persons, and when containing one or more photographs of two or more different 
persons, the different persons can have the same or different known proportional ancestries. 

[0179] The article of manufacture, i.e., the photograph of a person having a known 
proportional ancestry corresponding to a population structure comprising nucleotide 
occurrences of AIMs indicative of BGA can be a digital photograph, which comprises digital 
information, including for the photographic image and any other information that may be 
relevant or desired (e.g., the age, name, or contact information of the subject in the 
photograph, Or the subject's answer on a questionnaire as to what the subject believes his or 
her ancestry to be). Such digital information of one or more digital photographs can be 
contained in a database thus facilitating searching of the photographs and/or known 
proportional ancestry information using electronic means. As such, the present invention 
further provides a plurality of the articles of manufactures, including at least two digital 
photographs, each of which comprises digital information. Where the digital information for 
one or a plurality of the articles is contained in a database, it can comprise any medium 
suitable for containing such a database, including, for example, computer hardware or 
software, a magnetic tape, or a computer disc such as floppy disc, CD, or DVD. As such, the 
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database can be accessed through a computer, which can contain the database therein, can 
accept a medium containing the database, or can access the database through a wired or 
wireless network, e.g., an intranet or internet. 

[0180] The present invention also provides kits useful for practicing a method of the 
invention. Such kits can contain, for example, a plurality of hybridizing oligonucleotides, 
each of which has a length of at least fifteen contiguous nucleotides of a polynucleotide as set 
forth in SEQ ID NOS: 1 to 33 1 (or a polynucleotide complementary thereto), the plurality 
including at least five (e.g., 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, etc.) of such oligonucleotides, 
each based on different polynucleotides as set forth in SEQ ID NOS:l to 331. In one 
embodiment, the hybridizing oligonucleotides that include at least fifteen contiguous 
nucleotides of at least five polynucleotides as set forth in SEQ ID NOS: 1 to 7f, or 
polynucleotides complementary to any of SEQ ID NOS:l to 71. In another embodiment, the 
hybridizing oligonucleotides are specific for at least ten AIMs as set forth in SEQ ID NOS:l 
to 71 . A kit of the invention also can contain at least two panels of such hybridizing 
oligonucleotide, including, for example, a panel of at least five (e.g., 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, etc.) hybridizing oligonucleotides specific for AIMs as set forth in SEQ ID 
NOS:7, 21, 23, 27, 45, 54, 59, 63, and 72 to 152; or a panel of at least five hybridizing 
oligonucleotides specific for an AIM as set forth in SEQ ID NOS:3, 8, 9, 1 1, 12, 33, 40, 59, 
63, and 1 53 to 239; or a panel of at least five hybridizing oligonucleotides specific for AIMs 
as set forth in SEQ ID NOS: 1, 8,' 1 1, 21, 24, 40, 172, and 240 to 331; or two or more of such 
panels and/or a panel of at least five hybridizing oligonucleotides specific for AIMs as set 
forth in SEQ ID NOS: 1 to 71. 

[0181] The hybridizing polynucleotides of a kit of the invention can include probes, which 
are useful for detecting a particular AIM, including a particular nucleotide occurrence at the 
SNP position of the AIM; can include primers, including primers useful for a primer 
extension reaction and primer pairs useful for a nucleic acid amplification reaction; or can 
include combinations of such probes and primers. A hybri dizing oligonucleotide of the 
plurality can, but need not, include a nucleotide corresponding to nucleotide position of the 
SNP or DEP of an AIM, e.g., nucleotide 50 of an AIM as set forth in any of SEQ ID NOS:l 
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to 55 and 57 to 33 1 or nucleotide 26 of SEQ ID NO:56, or to a nucleotide sequence 
complementary thereto, such a hybridizing oligonucleotide being useful as a probe to identify 
the presence or absence of a particular nucleotide occurrence at the SNP position of the AIM. 

[0182] A kit of the invention also can contain at least one pair of hybridizing 
oligonucleotides useful for detecting the nucleotide occurrence at the SNP position or the 
presence or absence of a nucleotide sequence the DIP position of an AIM. For example, a 
pair of hybridizing oligonucleotides can include one oligonucleotide that hybridizes upstream 
and adjacent to the SNP position of an AIM and a second oligonucleotide that hybridizes 
downstream of and adjacent to the SNP position of the AIM, wherein one or the other of the 
pair further contains a nucleotide complementary to a nucleotide occurrence suspected of 
being at the SNP position of the AIM (i.e., one of the polymorphic nucleotides), such a pair 
of hybridizing oligonucleotides being useful in an oligonucleotide ligation assay. In another 
example, a pair of hybridizing oligonucleotides can include an amplification primer pair, 
including a forward primer and a reverse primer, such a pair of hybridizing oligonucleotides 
being useful for amplifying a portion of polynucleotide that includes the SNP or DIP position 
of the AIM. 

[0183] A kit of the invention can further contain additional reagents useful for practicing a 
method of the invention. As such, the kit can contain one or more polynucleotides 
comprising an AIM, including, for example, a polynucleotide containing an AIM for which a 
hybridizing oligonucleotide or pair of hybridizing oligonucleotides of the kit is designed to 
detect, such polynucleotide(s) being useful as controls. Further, hybridizing oligonucleotides 
of the kit can be detectably labeled, or the kit can contain reagents useful for detectably 
labeling one or more of the hybridizing oligonucleotides of the kit, including different 
detectable labels that can be used to differentially label the hybridizing oligonucleotides; such 
a kit can further include reagents for linking the label to hybridizing oligonucleotides, or for 
detecting the labeled oligonucleotide, or the like. A kit of the invention also can contain, for 
example, a polymerase, particularly where hybridizing oligonucleotides of the kit include 
primers or amplification primer pairs; or a ligase, where the kit contains hybri dizing 
oligonucleotides useful for an oligonucleotide ligation assay. In addition, the kit can contain 
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appropriate buffers, deoxyribonucleotide triphosphates, etc., depending, for example, on the 
particular hybridizing oligonucleotides contained in the kit and the purpose for which the kit 
is being provided. 

[0184] The following examples are intended to illustrate but not limit the invention. 

EXAMPLE 1 

DETERMINATION OF BIOGEOGRAPHICAL ANCESTRY 
USING ANCESTRY INFORMATIVE MARKERS 

[0185] This Example demonstrates that a panel of 32 Ancestry Informative Markers 
(AIMs) allows an estimate of the genetic contribution from populations of Affican, European 
and Native American ancestry. 

[0186] The AIMs used in the exemplified study include single nucleotide polymorphisms 
(SNPs), deletion/insertion polymorphisms (DIPs) and Alu sequences (see Example 2 for 
identification of AIMs). Markers showing differences between the parental populations 
greater than 30% were selected (Table 1; see, also, SEQ ID NOS:332-363). Informative 
genetic markers were identified by testing each candidate marker in a panel of European 
(Spanish, and German), Affican (from Nigeria, Sierra Leone, and Central African Republic), 
and Native American populations (Mayan and South Western Native Americans) to confirm 
the usefulness of the marker for admixture estimation. 
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TABLE 1 

Ancestry Informative Marker Panel 



MARKER 


LOCATION 


Mb 


AF/EU 


AF/NA 


EU/NA 


MID 575 (356*) 


lp34.3 


~42 


0.130 


0.417 


0.546 


MID 187 (357) 


lp32 


-50.2 


0.370 


0.440 


0.070 


FY-NULL (339) 


lq23.2 


-181 


0.999 


0.999 


0.000 


AT3 (359) 


lq25.1 


-196 


0.575 


0.777 


0.202 


F13B (338) 


lq31.3 


to 

o 


0.641 


0.674 


0.033 


TSC1 102055 


lq32.1 


-234.5 


0.441 


0.303 


0.744 


(343) 

WI-11392 (NS**) lq42.2 


-269.5 


0.444 


0.256 


0.188 


WI-16857 (345) 


2pl6.1 


-56.2 


0.536 


0.548 


0.012 


WI-11153 (346) 


3pl2.1 


-95.0 


0.652 


0.022 


0.629 


GC*1F (NS) 


4ql3.3 


-75.7 


0.697 


0.530 


0.166 


GC*1S (NS) 


4ql3.3 


-75.7 


0.538 


0.478 


0.060 


MID-52 (NS) 


4q24 


-110.7 


0.186 


0.500 


0.687 


SGC30610 (354) 


5ql 1 .2 


-61.5 


0.146 


0.281 


0.427 


SGC30055 (355) 


5q22.1 


-124.7 


0.457 


0.675 


0.218 


WI-17163 (347) 


5q33.1 


-173.9 


0.120 


0.641 


0.521 


WI-9231 (348) 


7p22.3 


-1.2 


0.017 


0.387 


0.370 


WI-4019 (349) 


7q21.3 


-100 


0.124 


0.173 


0.296 


CYP3A4 (NS) 


7q22. 1 


-101.9 


0.761 


0.755 


0.006 


LPL (340) 


8p21.3 


-22.3 


0.479 


0.521 


0.042 


CRH (NS) 


8ql3.2 


-73.2 


0.609 


0.655 


0.046 


WI-11909 (350) 


9q21.31 


-81.0 


0.075 


0.587 


0.663 


D11S429 (337) 


llql3.3 


-70.4 


0.429 


0.054 


0.376 


TYR (344) 


llql4.3 


-95.4 


0.444 


0.055 


0.389 


DRD2-Taq I ”D" 


llq23.2 


-125.0 


0.535 


0.046 


0.582 


(336) 

DRD2-Bcl I (335) llq23.2 


-125.0 


0.080 


0.565 


0.485 


APOA1 (360) 


llq23.3 


-128.9 


0.505 


0.555 


0.050 
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GNB3 (332) 


12pl3.31 


~ 7.2 


0.463 


0.430 


0.033 


RBI (361) 


13ql4.2 


-47.4 


0.611 


0.711 


0.100 


OCA2 (342) 


15ql2 


-24.0 


0.631 


0.369 


0.263 


WI-14319 (351) 


15ql4 


-30.0 


0.185 


0.310 


0.494 


CYP19 (334) 


15q21.2 


-47.6 


0.045 


0.379 


0.423 


PV92 (362) 


16q23.3 


-96.5 


0.073 


0.551 


0.624 


MC1R314 (341) 


16q24.3 


-103.8 


0.350 


0.441 


0.090 


WI-14867 (352) 


17pl3.2 


-3.5 


0.448 


0.404 


0.045 


WI-7423 (353) 


17pl3.1 


-8.2 


0.476 


0.074 


0.402 


Sbl9.3 (363) 


19pl3.il 


-27.0 


0.488 


0.236 


0.253 


CKM (333) 


19ql3.2 


-55.8 


0.150 


0.694 


0.545 


MID 154 (358) 


20ql 1 .23 


-34 


0.444 


0.368 


0.076 


MID 93 (NS) 


22ql3.2 


-38.6 


0.554 


0.179 


0.733 



Shown are the marker name and chromosomal band, approximate location of the marker on 
the chromosome in megabases (Mb), and difference in frequency between African and 
European populations (AF/EU), African and Native Americans (AF/NA) and European and 
Native Americans (EU/NA). Differences greater than 30% are marked in bold letters (see, 
also, Shriver et al., supra, 2003, which is incoiporated herein by reference) 

* Numbers in parentheses are SEQ ID NO: for AIMs; NS - sequences not shown. 

[0187] The publicly available human genome sequence database and polymorphism 
database were screened in order to identify SNPs that met the criteria for being a good AIM. 
Allele frequencies are available for many of the SNPs in the public databases for three 
populations - Africans, Europeans and Asians. Since these frequencies are obtained from 
small samples they are not always accurate. The main criteria for selection herein was the 
delta value that derived from using these frequencies, which is a statistical measure of the 
difference in minor allele frequency between various populations of human beings. For 
example, a C or a G polymorphism at a particular place in the human genome, where the C is 
present mainly in individuals of European descent and the G present mainly in individuals of 
Native American descent, would have a high delta value and, therefore, qualify as a good 
AIM. Similarly, an A or a C polymorphism at a particular place in the human genome, where 
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the A is present mainly in individuals of African descent and the C present mainly in 
individuals of Asian descent would have a large frequency differential between these groups 
and, therefore, a high delta value, thus qualifying as a good AIM. A list of such "candidate 
AIMs was compiled, ranked from largest delta value to smallest delta value for each of the 
possible pair-wise population comparisons, and screened, one at a time, against a panel of 
"parental" samples. Parental samples are samples from regions of the world that are 
relatively homogeneous, for example, Niger or Congo for sub-Saharan Africans, Southern 
Mexico for Native Americans, China for East Asians, and Europe for Europeans. 

[0188] About half of the candidate AIMs proved to be not very useful because their actual 

delta values were not as high as expected from the public database allele frequencies (some 
were not even SNPs, or could not be assayed using the present platform). Sequences that 
were validated as true AIMs, such as those exemplified herein, were useful for admixture 
mapping, making inferences of individual ancestry proportions, and making inferences of 
population group admixture proportions, as well as for screening genomes in order to identify 
markers with alleles that correlated with certain human traits through their ancestry 
informativeness. Even though each candidate AIM was initially selected from the public 
database based on crude population structure differences (i.e., continental populations), many 
of them were found to carry information on finer levels of structure because the separation of 
subgroups of humans from larger groups throughout human evolution has provided a fertile 
opportunity for genetic drift, founder effects, and natural selection to operate in either fixing 
or eliminating their sequences. 

[0189] Sequences are shown in the Sequence Listing from 5' to 3' (left to ri ght ), and, for 
SEQ ID NOS:l to 33 1, with the SNP generally, but not always, at nucleotide position 50 
from the 5' terminus (except for SEQ ID NO:56, position 26). The polymorphism is 
indicated with an IUB symbol, wherein S=C/G, Y=C/T, R=A/G, K=G/T, W=A/T, etc. As 
such, the disclosed sequences (SEQ ID NOS:l to 331) provide information as to the target 
being examined (i.e., the polymorphism) as well as information for preparing primers and 
amplification primer pairs, and hybridization probes, for sampling the SNP (i.e., determining 
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the genotype of a sample). Further, the disclosed sequences can be used, if desired, to scan 
Public databases to identify additional upstream and downstream nucleotide sequences. 

[0190] This panel of markers was extremely powerful for estimating with precision 

admixture proportions in population samples (standard error typically between 1% and 5%). 
In addition, the AIMs provided reasonable estimates of individual ancestry, and suggested 
that an equivalent precision can be obtained using more markers (confirmed in Example 2). 
Two independent methods, a Maximum Likelihood Estimate (MLE) method (Chakraborty et 
al., supra, 1986) and a Bayesian method using the program STRUCTURE (Pritchard et al., 
supra, 2000) estimating individual ancestry, were used; the values obtained by both methods 
were highly correlated (R 2 = 0.9836) when estimates of individual ancestry were compared in 
terms of percent Afiican genetic contribution in a sample of African Americans from 
Washington DC. These markers are excellent for determining whether there is population 
structure in a sample from an admixed population. This ability is important in terms of 
admixture mapping applications because, as discussed below, the process of admixture can 
produce significant structure in a population, and consequently a high number of false 
positive results (positive associations caused by genetic structure, not by physical linkag e of a 
marker with a disease causative gene), increasing significantly the risk of misinterpreting 
mapping results. 

[0191] The present study confirms that AIMs can be identified using the disclosed 
methods, and provides a panel of 32 AIMs that can be applied towards an ultimate goal of 
compiling a panel of approximately 1,000 AIMs spanning the entire human genome. 
Candidate AIMs were obtained by screening SNP allele frequency data generated through 
The SNP Consortium (TSC). Six sites, including the Sanger Centre, Celera Genomics, 
Washington University, Orchid Biosciences, Motorola, and Whitehead Institute, have 
generated, as of 2003, allele frequencies on 60,000 SNPs located throughout the genome 
using a central collection of 42 individuals from each of 3 populations (African-American, 
European- American, and Asian-American). This database, which is freely available to 
researchers (see, e.g., using hypertext transfer protocol ("http"), at URL "snp.cshl.org"), has 
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been used to provide the present results, thus demonstrating the usefulness of the resource to 
compile a genome- wide panel of AIMs. 

[ 0192 ] The present study focused on the accuracy of the SNP database and the number of 
candidate SNPs present therein. With respect to the accuracy of the database, each group 
involved in the SNP consortium has taken a different approach to generating data. As such, 
initial concerns regarding how the data can be combined was addressed. Because the 
genotyping approaches were different for each group, it was necessary to address the question 
of ascertainment biases that might differentially affect the data of particular groups. For 
example, most of the groups produced their allele frequencies after sequencing a subset of the 
TSC diversity panels, then scoring these markers in the larger groups of 42 individuals from 
the 3 populations. The Washington University group has taken an approach whereby pooled 
sequencing throughout regions was performed, and the allele frequencies calculated for 
variable positions discovered during this effort. The Orchid group has not used sequencing 
but, instead, started with loci from the TSC SNP database that are known to be polymorphic. 
Given such differences, a systematic characterization was made as the extent, if any, that 
different biases may have affected the results. 

[ 0193 ] One approach for systematically characterizing such potential bias was to compare 
the allele frequencies for loci that were genotyped by more than one group. Although there 
was dispersion around the 45° line, as expected, there was general agreement in the frequency 
data obtained by the different groups (R 2 =0.8762), indicating that the extent of the allele 
frequency bias introduced by the differing genotyping and ascertainment strategies was 
limited. The next step in testing the accuracy of these data was to separate the data by site 
and perform pair-wise comparisons, which would allow the identification of particular sites 
that have allele frequency estimates that deviate more when compared to other sites. 

[ 0194 ] With respect to the number of candidate AIMs, it also was important to determine 
how many of the 60,000 SNPs characterized by the TSC would be useful for admixture 
mapping. Since it can be useful ultimately to compile a panel of about 1 ,000 markers 
showing large frequency differences between the relevant population groups (Fst>0.4), it was 
important to evaluate which percentage of the available markers have the desired 
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characteristics. Candidate AIMs were based on the recommendation of McKeigue et al. 
(supra, 2000). The markers with information available for African, Asian and European 
populations, the cumulative proportion of markers in each Fst category (0-1, in 0.05 intervals) 
and the total number of candidate AIMs for each possible comparison. The distribution of 
pairwise Fst from the TSC allele frequency project was as follows: Asian-European (556 
candidate AIMs/25,1 10 total SNPs; average Fst = 0.0720); Asian-African (1026 candidate 
AIMs/25,578 total SNPs; average Fst = 0.0886); and European-African (1306 candidate 
AIMs/30,103 total SNPs; average Fst = 0.0861). As such, the screen revealed that about 
2-5% of the markers can be useful for admixture mapp ing 

[0195] The geographical pattern of admixture in the US admixed populations, particularly 
African Americans and Hispanics, was the subject of an initial examination. The admixture 
proportions of more than 18 African-American populations were characterized, and a map 
was generated showing an estimate of the European genetic contribution to African 
Americans from several different geographic areas in the United States. The European 
admixture ranged between 3.5% in the Gullah of South Carolina to 22.5% in New Orleans 
(e.g., 1 8.8% in Chicago; and 16.4% in Houston). Most of these estimates were obtained 
using an initial panel of 10 informative AIMs. The observed distribution was interpreted in 
terms of well known historical and demographic events that have played an important role in 
African American history (see Parra et al., supra, 1 998, Parra et al., supra, 2001). These data 
allow the application of admixture mapping to identify genes involved in complex diseases. 

It is expected that admixture mapping will be more suitable in populations showing a high 
degree of admixture and, therefore, populations such as the Gullah (3.5%) and Jamaicans 
(6.6%), in which European genetic contribution has been very limited, may not be suitable 
for this kind of analysis. 

[0196] In preliminary studies using mitochondrial DNA (mtDNA), it was observed that 
African Americans have a detectable, although low. Native American genetic contribution, in 
accordance with the self-reported Native American ancestry often mentioned by African- 
American individuals. Having identified 30 AIMs informative for African/Native American 
and 19 AIMs for European/Native American contrasts (see Table 1; see, also, Shriver et al.. 
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supra, 2003), the presence of Native American admixture in three African-American 
populations was examined using nuclear DNA markers. In accordance with the mtDNA 
estimates (which only provide "maternal" contribution information), evidence of a low Native 
American genetic contribution was detected in each of the African-American samples 
(Washington DC, 6%; Afro-Caribbeans from London, 5%; and Bogalusa, Louisiana, 6%). 

[0197] Regarding admixture in Hispanics, the relative European, Native American and 
African contribution in a sample of Spanish-Americans from San Luis Valley CO was 
estimated. A 59% European admixture, 35% Native American admixture, and 6% African 
admixture was observed in this sample, in good agreement with estimates previously 
described for populations of Mexican ancestry (Chakraborty et al., supra, 1986; Hanis et al., 
supra, 1991; Tseng et al., Amer. J. Phys. Anthropol. 106:361-71.1998; Collins-Schram et al., 
supra, 2002). As shown in Example 2, further characterization of admixture in additional 
samples from Mexico and in two samples from Hispanics of Puerto Rican ancestry (New 
York and Puerto Rico) has been performed. 

[0198] A sample of individuals of European ancestry (N=199) currently living in State 
College PA also were analyzed. The genetic contribution in this sample was predo minant ly 
of European origin (91%), with evidence of some African (3%) and Native American (6%) 
influence. These results are summarized in Figure 4, using a triangular plot, which clearly 
reveals the differences in average admixture levels between European Americans, Spanish 
Americans, and African Americans. The triangular plot shown in Figure 4 represents the 
average admixture estimate in a particular sample; the underlying distributions of individual 
ancestry are complex, with different individuals showing widely dispersed values of African, 
European and Native American ancestry (not shown). In African Americans, most 
individuals showed predominantly African genetic contribution, but some persons showed 
relatively high European contribution and also, to a lesser extent, Native American ancestry. 
European Americans clustered more tightly near the pole corresponding to high European 
contribution, with few persons showing evidence of Native American and African ancestry. 
Spanish Americans showed the highest dispersion of individual ancestry, as expected given 
the high admixture level observed in this sample. 
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[0199] Notably, individuals showed the whole range of European and Native American 
ancestry (from 100% European to 100% Native American), and a relatively lower African 
genetic contribution also was evident in some individuals. Some of the variance observed in 
individual ancestry was likely due to the stochastic error due to the limited number of 
markers used to infer ancestry. Thus, while the 20-32 markers used in the exemplified' test 
detected individual ancestry, the standard error of the estimates was fairly high; increasing 
the number of AJMs is expected to increase the precision of the individual ancestry estimates 
(see Example 2). The other component of the variation in individual ancestry was due to true 
differences in ancestry between individuals. The remarkable correlation in individual 
ancestry values obtained by two totally independent methods, ML and STRUCTURE 
(discussed above), indicates that this panel of markers can capture the underlying individual 
ancestry patterns characteristic of these populations. As disclosed below, controlling for 
variations in individual ancestry allows the avoidance of false positive results. 

[0200] The effect of admixture dynamics in population structure and the extent of linkage 

disequilibrium (LD) was examined. The importance of the admixture model (hybrid isolation 
model vs. continuous gene flow model) in terms of the population structure and LD created in 
the admixture process was previously described (Pfaff et al., supra, 2001), and two methods 
to quantify the level of population structure in admixed populations were presented. 
Population structure is a key aspect of admixture mapping, as well as of any genetic 
association study in an admixed population. This issue has been explored in an 
African-American, a Spanish- American and a European- American sample using more 
informative markers than previously available. 

[0201] The presence of structure was evaluated in two different ways. First, the observed 
number of significant associations was compared between unlinked markers with the number 
expected at the 5% significance level, and second, the average correlation of individual 
ancestry was estimated using two subsets of genetic markers. In agreement with previously 
reported data, the African-American population from Washington DC showed a significant 
genetic structure, as reflected by a much higher number of significant associations between 
unlinked markers than expected by chance (10.5% vs. 5%, Figure 5A). Very strong 
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associations were observed between markers located as far apart as 24 Mb (AT3-F13B, 
(3=15.21, p<0.0001), providing clear evidence that these significant associations are caused 
by the admixture process. The alleles that were associated were always combinations of the 
alleles that are frequent in African populations, with the higher the difference in frequency, 
the more frequently associations were observed between markers: FY, showing the highest 
difference in frequency between African and Europeans, was significantly associated with 9 
unlinked markers. Thus, the admixture process in this African-American population, 
showing 17% European ancestry, created a strong association between markers showing high 
frequency differences between African and European populations. Although these 
associations are significant both when the markers are linked and when they are unlinked, 
linked markers tended to show higher G values than unlinked markers, indicating that the 
association due to true linkage can be distinguished from the association due to genetic 
structure, as previously demonstrated (McKeigue et al., supra , 2000). Interestingly, 
significant associations were observed between markers showing high frequency differences 
between European and African populations, but not between markers showing high frequency 
differences between African and Native American, or European and Native American 
populations. This result was likely due to the low Native American ancestry observed in this 
sample (6%), which was insufficient to create detectable associations due to the admix ture 
process in such a small sample. 

[0202] Another line of evidence demonstrating the high level of genetic structure present 
in this African-American sample was the significant correlation between independent 
estimates of individual ancestry using different subsets of the genetic markers. The average 
correlation over 100 random selections of independent subsets of markers was r=0.40, 
p<0.0001 (Figure 5B). The pattern of genetic structure and LD observed in Washington DC 
African Americans and in other African-American samples analyzed with a more limited set 
of markers (Jackson MI, and the Low Country counties, South Carolina, Pfaff et al., supra, 
2001) indicated that the best model to describe the admixture process in these populations 
was the continuous gene flow model. Additional support for this model came from the strong 
correlation between Do (the initial expected association between markers originated by the 
admixture process) and Dt (the current association between markers). As shown using 
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computer simulations (Pfaff et al., supra, 2001), in populations following the continuous gene 
flow model, positive correlations between D 0 and D, are expected and, in fact, this result was 
observed in African-American populations. In populations following the hybrid isolation 
model, significant correlations between Do and D t are not expected. 

[0203] The Spanish- American sample from San Luis Valley that was analyzed showed 

less genetic structure than any of the African-American populations. The number of 
observed significant associations between unlinked markers was only slightly higher than 
expected at a 5% significance level (7.3% vs. 5%, Figure 5A). This result was interesting 
considering that the Spanish- American population was considerably more admixed than the 
African Americans, and under the same model of admixture dynamics, would be expected to 
show considerably more structure. The correlation of individual ancestry estimates based on 
independent markers, although significant, was much lower than the values observed in the 
African-American populations (r=0.11, p<0.0001. Figure 5B). Also, there was no correlation 
of Do and D t in the San Luis Valley sample, in contrast to the results observed in African 
Americans. These results demonstrate that admixture dynamics (the way in which the 
population was formed and has evolved) have been different in the African-American 
populations and the San Luis Valley population, with the former more closely resembling the 
continuous gene flow model than the latter. Of course, other Hispanic populations can show 
different patterns of admixture dynamics than that observed in San Luis Valley. 

[0204] As expected from the lower admixture levels observed in European Am ericans, 

I 

there was no evidence of genetic structure due to admixture in this sample from State College 
PA (Figures 5A and 5B). The number of significant associations between unlinked markers 
was similar to the value expected by chance, and there was no correlation between individual 
ancestry estimates of independent subsets of markers (p=0. 149, NS). 

[0205] These results demonstrate that the use of selected genetic markers (AIMs) allows 

an analysis of the dynamics of the admixture process and the effect of this admixture process 
on the pattern of LD in admixed populations. In admixed populations that have had an 
admixture process similar to the hybrid isolation model (initial admixture followed by 
independent evolution of the admixed population without further genetic contribution of the 
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parental populations), few false positive results are expected (recalling, that the falseness is 
relevant for a "gene hunter" searching for genes that cause traits through LD or linkage, not 
for those seeking to develop classification tools). In admixed populations that more closely 
resemble the continuous gene flow model (continuous genetic contribution each generation 
from one of the parental populations to the admixed population), the LD is expected to extend 
much longer distances and problems with false positive results will arise. Fortunately for the 
gene hunter, the information conveyed by the AIMs can control for genetic structure and 
minimize false positives. An example is provided below demonstrating how such control can 
be achieved using appropriate statistical methods and skin pigmentation as a model 
phenotype. 

[0206] Skin pigmentation and individual ancestry was examined in an African-American 
sample and a Spanish-American sample. As previously demonstrated, the genetic structure 
created by admixture can be effectively controlled, and association due to linkage can be 
distinguished from spurious association due to genetic structure using appropriate statistical 
tests (McKeigue et al., supra, 2000). In the present study, the same methods were applied in 
a study of skin pigmentation in two admixed samples (African Americans from Washington 
DC and Spanish Americans from San Luis Valley). Information on skin pigmentation was 
collected for each individual in both studies, and the subjects were genotyped for a panel of 
AIMs and individual ancestry proportions were calculated using the maximum likelihood 
method (Chakraborty et al., supra, 1986). Individual ancestry (% African or % Native 
American) was plotted against melanin index (African) or skin reflectance (Native American) 
for each individual. Several of the AIMs showing high differences in frequency between the 
parental populations were also candidate genes for pigmentation. 

[0207] In the African-American sample, a strong and highly significant correlation 
(R 2 =0.1879, p<0.0001) was observed between individual ancestry and the melanin index, 
which measures the melanin content of the skin. Individuals with darker skin had, on 
average, higher levels of African ancestry. The individual ancestry estimates were based on 
21 markers, and, therefore, subject to a relatively high variance, thus explaining at least some 
of the dispersion observed in the graph. An interesting feature of these results was the 
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evident decrease in variance observed in moving from the right (more African ancestry) to 
the left (more European ancestry). This result is consistent with the higher level of variability 
in skin color that is found in African populations as compared to Europeans. The high 
correlation observed between individual ancestry and skin pigmentation can be due to the 
population structure typical of African-American populations (as discussed above), and 
related to the limited number of genes that were used for determining the parental population 
differences contained within this relationship. 

[ 0208 ] A similar plot was prepared for the San Luis Valley sample. Individual ancestry 
estimates using 15 Native American/European AIMs were plotted against pigmentation level 
as measured by the percent of light reflected through the PHOTOVOLT 670 Green filter. 
Because skin pigmentation was measured in different ways (absorbance vs. reflectance) in 
these two studies, the trends observed when graphed are reversed. In the Spanish-American 
sample, the correlation between individual ancestry and skin color also was significant 
(R 2 =0.0481, p<0.001), but lower than in the African-American sample, possibly due to the 
reduced genetic structure present in this sample. 

[ 0209 ] Tests for differences in the average pigment levels by genotype for the AIMs typed 
in the African-American population sample discussed above were performed. The panel of 
AIMs included three candidate gene markers, OCA2, TYR, and MC1R. The analysis was 
performed in three alternative ways: first with no consideration of the individual ancestry 
estimates (ANOVA); second after conditioning to control for the effect of individual ancestry 
leaving out the locus under consideration (ANCOVA/IAE minus marker); and third using the 
complete individual ancestry estimate for the conditioning (ANCOVA/IAE). As shown in 
Table 2, eight of twenty-one (38%) of the markers showed significant differences (p < 0.05) 
among the three genotypes, including two of the four candidate gene markers (OCA2 and 
TYR). When using an alpha level of 0.05, only 5% of the markers tested were expected to 
yield significant results. As such, the finding of 38% significant difference indicates that 
population structure is related to both ancestry and pigmentation (Pfaff et al., supra, 2001, 
Parra et al., supra, 2001). 
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[ 0210 ] One way to remove the effects of population structure is to test for differences 

conditioning on the individual ancestry estimates (LAE). When the complete IAE was used to 
condition (ANCOVA/LAE), only one locus showed significant average differences among 
genotypes, OCA2, the human P gene. When a less conservative conditioning approach was 
taken, in which the locus under consideration was left out of the individual ancestry estimate 
(ANCOVA/IAE minus marker), there were four significant results: OCA2, TYR, FY, and 
SGC30055. 

[ 0211 ] A Bayesian full probability model for admixture and marker genotypes was also 
set up (McKeigue et al., supra, 2000). Score tests for linkage were based on testing for an 
independent association of pigmentation with number of alleles of European ancestry at each 
locus, one at a time, in a regression model that includes individual ancestry (estimated from 
marker data). The 1 -sided probabilities for the score tests are shown in Table 2, where three 
loci showed evidence of linkage to skin pigmentation at an alpha level of 0.05 {OCA2 
(p = 0.005), AT3 (p = 0.027), and TYR (p = 0.033)}. To confirm these results, other markers 
informative for ancestry in QCA2 were identified and will be analyzed by the score test 
method. The concordance between these ANOVA results and the Bayesian admixture 
mapping results was encouraging, and both methods will benefit from the addition of new 
unlinked AIMs, which will increase the precision of the individual ancestry estimates. 

[ 0212 ] The Spanish- American sample from the San Luis Valley CO, also was analyzed 

for linkage and association using the Bayesian and ANOVA methods (Table 3). This 
analysis included 442 individuals who were typed for 15 marker loci informative for ancestry 
(2 SNPs in the DRD2 gene treated as one locus). The CYP19E2 marker (located near 
MY05A, a pigmentation candidate gene) showed strong evidence for linkage with the ethnic 
difference in skin pigmentation. However, this result should be interpreted with caution 
because, unless several closely linked markers informative for ancestry are used, the test for 
linkage is not robust to misspecification of ancestry-specific allele frequencies. SNPs around 
the MY05A gene can be analyzed to confirm these preliminary results. 




